Extracting Insights from Data at Twitter

Extracting Insights from Data at Twitter
Prasad Wagle
Technical Lead, Core Data and Metrics, Data Platform
twitter.com/prasadwagle
Jan 26, 2016

● What are the properties of Big Data at Twitter?
● Where do we store it and how do we process it?
● What do we learn from the data?
Overview of the talk

● Velocity: Rate at which data is created
○ 313 million monthly active users. (June 2016)
○ Hundreds of millions of Tweets are sent per day. TPS record:
one-second peak of 143,199 Tweets per second
○ 100 Billion interaction events per day
● Volume: 100s of petabytes of data
● Variety: Tweets, Users, Client events and many more
○ Client events logs have a unified Thrift format for wide variety of
application events
3Vs of Big Data @Twitter

Data Processing Big Picture
Production
systems
Batch
Scalding
Spark
Real-time
Heron
Lambda (Batch + Real-time)
Summingbird
TSAR
Interactive
Presto
Vertica
R
Custom
Dashboards
Tableau
Apache Zeppelin
Command line
tools
Batch
Hadoop
(HDFS
MapReduce)
Analytics Tools
Analytics Front-ends
Real-time
Eventbus,
Kafka
Streams
Data Abstraction Layer (DAL), Pipeline Orchestration

● Batch Processing Engine - Hadoop
● Real-time Processing Engine - Heron
● Core Data Libraries - Scalding, Summingbird, Tsar, Parquet
● Data Pipeline - Data Access Layer (DAL), Orchestration
● Interactive SQL - Presto, Vertica
● Data Visualization - Tableau, Apache Zeppelin
● Core Data and Metrics
Data Platform Projects

● Largest Hadoop clusters in the world, some > 10K nodes
● Store 100s of petabytes of data
● More than 100K daily jobs
● Improvements to open source hadoop software
● hRaven - tool that collects run time data of hadoop jobs and lets users
visualize job metrics
○ YARN Timelineserver is next-gen hRaven
● Log pipeline software (scribe -> HDFS)
○ Scribe is being replace by Flume
Hadoop

● Heron - a real-time, distributed, fault tolerant stream processing engine
● Successor of Storm, API compatible with Storm
● Analyze data as it is being produced
● > 400 real-time jobs, 500 B events / day processed, 25 - 200 ms latency
● Use cases
○ Real-time impression and engagement counts
○ Real-time trends, recommendations, spam detection
Real-time Processing

● Tools that make it easy to create MapReduce and Heron jobs
● Scalding
○ Scala DSL on top of Cascading
● Summingbird
○ Lambda architecture: real-time and batch
● Tsar: TimeSeries AggregatoR
○ DSL implemented on top of Summingbird
Core Data Libraries

● DAL is a service that simplifies the discovery, usage, and maintainability
of data
● Users work with logical datasets
● Physical dataset describes the serialization of a logical dataset to a
specific location (hadoop, vertica) and format
● Logical dataset can simultaneously exist in multiple places
● Users can use logical dataset name to consume data with different
tools like Scalding, Presto
Data Access Layer (DAL)

● Eagleeye web application is front-end for end users
● Users discover datasets with Eagleeye
● Eagleeye displays metadata like owners and schema
● Applications access to datasets is recorded
● Enables Eagleye to show dependency graphs for a dataset - jobs that
produce a dataset and jobs that consume it
Data Access Layer (DAL)

● Statebird service
○ Tracks state of batch jobs
○ Used to manage dependencies
Pipeline Orchestration

● Interactive means that results of a query are available in the range of
seconds to a few minutes
● SQL is still the lingua franca for ad hoc data analysis
● Vertica
○ Columnar architecture, high performance analytics queries
● Presto
○ Data in HDFS in Parquet format
Interactive SQL

● Custom Dashboards
● Apache Zeppelin Strengths
○ Notebook metaphor - notebook is a collection of notes, each note
is a collection of paragraphs (queries)
○ Web based report authoring, collaborative like Google docs
○ Very easy to create a note and then share it
○ > 2K notes, 18K queries
○ Supports JDBC (Presto, Vertica, MySQL)
○ Open source, Easy to add new interpreters like Scalding
Data Visualization

● Tableau Strengths
○ Easy to create reports, does not require SQL expertise
○ Built in analytics functions e.g. Rank, Percentile
○ Polished visualizations
○ Row level security
Data Visualization

● Big part of data analysis is data cleansing
● Makes sense to do this once
● Core Data
○ Create pipelines to create “verified” datasets like Users, Tweets,
Interactions
○ Reliable and easy to use
● Core Metrics
○ Create pipelines to compute Twitter’s important metrics
○ DAU, MAU, Tweet Impressions
Core Data and Metrics

● Analytics - Basic Counting
● A/B Testing
● Data Science - Custom analysis
● Data Science - Machine Learning
Data Processing

● Daily/Monthly Active Users
● Number of Tweets, Retweets, Likes
● Tweet Impressions
● Logic is relatively simple
● Challenges: scale and timeliness
○ Results for previous day should be available by 10 am
○ Some metrics are real-time
Basic Counting

● Goal: find the number of impressions and engagements for a tweet
● Real-time
● Used in analytics.twitter.com
Example - Counting Tweet Impressions

aggregate {
onKeys(
(TweetId)
) produce (
Count
) sinkTo (Manhattan)
} fromProducer {
ClientEventSource(“client_events”)
.filter { event => isImpressionEvent(event) }
.map { event =>
(event.timestamp, ImpressionAttributes(event.tweetId))
}
}
TSAR job
Dimension
Metric
Data Sink
Data Source

● TSAR job is converted to a Summingbird job
● Summingbird job creates
○ Real-time pipeline with Heron
○ Batch pipeline with Scalding
● Users access results using TSAR query service
● Write once, run batch and real-time
Example - Counting Tweet Impressions

● Experimentation is at the heart of Twitter’s product development cycle
● Expertise needed in Statistics and Technology
A/B Testing Framework

● Goal: informative experiment,
● Minimize false positive and false negative errors
● How many users do we need to sample?
● How long should we run the experiment?
A/B Testing Statistics

● Process 100 B events daily, compute intensive.
● Metrics computed using Scalding pipeline that combines client event
logs, internal user models, and other datasets.
● Lightweight statistics are computed in a streaming job using TSAR
running on Heron.
A/B Testing Technology

● Cause of spikes and dips in key metrics
● Growth Trends
○ By country, client
● Analysis to understand user behavior
○ Creators vs Consumers
○ Distribution of followers
○ User clusters
● Analysis to inform product feature decisions
Data Science - Custom Analysis

● Recommendations
○ Users: WTF - who to follow
○ Tweets: Algorithmic timeline
● Cortex, Deep learning based on Torch framework
○ Identify NSFW images
○ Recognize what is happening in live feeds
Data Science - Machine Learning

● Product Safety
○ Detect fake accounts
○ Detect tweet spam and abuse
● Ad Targeting
○ Promoted Trends, Accounts and Tweets
○ Show only if it is likely to be interesting and relevant to that user
○ Predict click probability using signals including what a user
chooses to follow, how they interact with a Tweet and what they
retweet
Machine Learning

● Systems (Hadoop, Vertica)
○ Necessary because higher level abstraction are leaky
● Programming (Scala, Scalding, SQL)
● Math (Statistics, Linear Algebra)
Ideal Talent Stack
Systems Programming Statistics
Data Engineers Data Scientists

Data Platform and Data Science
work hand-in-hand
to extract insights from Big Data at Twitter
Summary

● TSAR https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
● DAL https://blog.twitter.com/2016/discovery-and-consumption-of-analytics-data-at-twitter
● Heron https://blog.twitter.com/2015/flying-faster-with-twitter-heron
● Heron http://www.slideshare.net/KarthikRamasamy3
● A/B testing https://blog.twitter.com/2015/twitter-experimentation-technical-overview
● A/B testing https://blog.twitter.com/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests
● Algorithmic timeline: https://support.twitter.com/articles/164083
● Cortex https://www.technologyreview.com/s/601284/twitters-artificial-intelligence-knows-whats-happening-in-live-video-clips/
● Cortex https://www.wired.com/2015/07/twitters-new-ai-recognizes-porn-dont/
References

Extracting Insights from Data at Twitter

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Extracting Insights from Data at Twitter

Ähnlich wie Extracting Insights from Data at Twitter (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Extracting Insights from Data at Twitter