2. The Twitter Data Lifecycle
‣ Data Input
‣ Data Storage
‣ Data Analysis
‣ Data Products
3. The Twitter Data Lifecycle
‣ Data Input: Scribe, Crane
‣ Data Storage: Elephant Bird, HBase
‣ Data Analysis: Pig, Oink
‣ Data Products: Birdbrain
1 Community Open Source
2 Twitter Open Source (or soon)
4. My Background
‣ Studied Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh routing algorithms,
GBs of data
‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣ Twitter: Hadoop, Pig, machine learning, visualization, social
graph analysis, (soon) PBs of data
5. The Twitter Data Lifecycle
‣ Data Input: Scribe, Crane
‣ Data Storage
‣ Data Analysis
‣ Data Products
1 Community Open Source
2 Twitter Open Source
6. What Data?
‣ Two main kinds of raw data
‣ Logs
‣ Tabular data
7. Logs
‣ Started with syslog-ng
‣ As our volume grew, it didn’t scale
8. Logs
‣ Started with syslog-ng
‣ As our volume grew, it didn’t scale
‣ Resources overwhelmed
‣ Lost data
9. Scribe
‣ Scribe daemon runs locally; reliable in network outage
‣ Nodes only know downstream
FE FE FE
writer; hierarchical, scalable
‣ Pluggable outputs, per category
Agg Agg
File HDFS
10. Scribe at Twitter
‣ Solved our problem, opened new vistas
‣ Currently 57 different categories logged from multiple sources
‣ FE: Javascript, Ruby on Rails
‣ Middle tier: Ruby on Rails, Scala
‣ Backend: Scala, Java, C++
‣ 7 TB/day into HDFS
‣ Log first, ask questions later.
11. Scribe at Twitter
‣ We’ve contributed to it as we’ve used it1
‣ Improved logging, monitoring, writing to HDFS, compression
‣ Added ZooKeeper-based config
‣ Continuing to work with FB on patches
‣ Also: working with Cloudera to evaluate Flume
1 http://github.com/traviscrawford/scribe
12. Tabular Data
‣ Most site data is in MySQL
‣ Tweets, users, devices, client applications, etc
‣ Need to move it between MySQL and HDFS
‣ Also between MySQL and HBase, or MySQL and MySQL
‣ Crane: configuration driven ETL tool
17. Crane
‣ Extract
‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣ Transform
‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
‣ Load
‣ MySQL, Local file, Stdout, HDFS, HBase
‣ ZooKeeper coordination, intelligent date management
‣ Run all the time from multiple servers, self healing
18. The Twitter Data Lifecycle
‣ Data Input
‣ Data Storage: Elephant Bird, HBase
‣ Data Analysis
‣ Data Products
1 Community Open Source
2 Twitter Open Source
19. Storage Basics
‣ Incoming data: 7 TB/day
‣ LZO encode everything
‣ Save 3-4x on storage, pay little CPU
‣ Splittable!1
‣ IO-bound jobs ==> 3-4x perf increase
1 http://www.github.com/kevinweil/hadoop-lzo
21. Elephant Bird
‣ We have data coming in as protocol buffers via Crane...
22. Elephant Bird
‣ We have data coming in as protocol buffers via Crane...
‣ Protobufs: codegen for efficient ser-de of data structures
23. Elephant Bird
‣ We have data coming in as protocol buffers via Crane...
‣ Protobufs: codegen for efficient ser-de of data structures
‣ Why shouldn’t we just continue, and codegen more glue?
24. Elephant Bird
‣ We have data coming in as protocol buffers via Crane...
‣ Protobufs: codegen for efficient ser-de of data structures
‣ Why shouldn’t we just continue, and codegen more glue?
‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig
HBaseLoaders
25. Elephant Bird
‣ We have data coming in as protocol buffers via Crane...
‣ Protobufs: codegen for efficient ser-de of data structures
‣ Why shouldn’t we just continue, and codegen more glue?
‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig
HBaseLoaders
‣ Also now does part of this with Thrift, soon Avro
‣ And JSON, W3C Logs
26. Challenge: Mutable Data
‣ HDFS is write-once: no seek on write, no append (yet)
‣ Logs are easy.
‣ But our tables change.
27. Challenge: Mutable Data
‣ HDFS is write-once: no seek on write, no append (yet)
‣ Logs are easy.
‣ But our tables change.
‣ Handling rapidly changing data in HDFS: not trivial.
‣
Don’t worry about updated data
‣
Refresh entire dataset
‣
Download updates, tombstone old versions of data, ensure jobs
only run over current versions of data, occasionally rewrite full dataset
28. Challenge: Mutable Data
‣ HDFS is write-once: no seek on write, no append (yet)
‣ Logs are easy.
‣ But our tables change.
‣ Handling changing data in HDFS: not trivial.
29. HBase
‣ Has already solved the update problem
‣ Bonus: low-latency query API
‣ Bonus: rich, BigTable-style data model based on column families
30. HBase at Twitter
‣ Crane loads data directly into HBase
‣ One CF for protobuf bytes, one CF to denormalize columns for
indexing or quicker batch access
‣ Processing updates transparent, so we always have accurate data in
HBase
‣ Pig Loader for HBase in Elephant Bird makes integration with
existing analyses easy
31. HBase at Twitter
‣ Crane loads data directly into HBase
‣ One CF for protobuf bytes, one CF to denormalize columns for
indexing or quicker batch access
‣ Processing updates transparent, so we always have accurate data in
HBase
‣ Pig Loader for HBase in Elephant Bird
32. The Twitter Data Lifecycle
‣ Data Input
‣ Data Storage
‣ Data Analysis: Pig, Oink
‣ Data Products
1 Community Open Source
2 Twitter Open Source
33. Enter Pig
‣ High level language
‣ Transformations on sets of records
‣ Process data one step at a time
‣ UDFs are first-class citizens
‣ Easier than SQL?
34. Why Pig?
‣ Because I bet you can read the following script.
35. A Real Pig Script
‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
37. Pig Democratizes Large-scale Data
Analysis
‣ The Pig version is:
‣ 5% of the code
‣ 5% of the time
‣ Within 30% of the execution time.
‣ Innovation increasingly driven from large-scale data analysis
‣ Need fast iteration to understand the right questions
‣ More minds contributing = more value from your data
38. Pig Examples
‣ Using the HBase Loader
‣ Using the protobuf loaders
39. Pig Workflow
‣ Oink: framework around Pig for loading, combining, running,
post-processing
‣ Everyone I know has one of these
‣ Points to an opening for innovation; discussion beginning
‣
Something we’re looking at: Ruby DSL for Pig, Piglet1
1 http://github.com/ningliang/piglet
40. Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣ How many unique queries, how many unique users?
‣ What is their geographic distribution?
41. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
‣ Search corrections, search suggestions
‣ A/B testing
42. Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
‣ What graph structures lead to successful networks?
‣ User reputation
43. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
‣ Language detection
‣ ... the list goes on.
44. The Twitter Data Lifecycle
‣ Data Input
‣ Data Storage
‣ Data Analysis
‣ Data Products: Birdbrain
1 Community Open Source
2 Twitter Open Source
45. Data Products
‣ Ad Hoc Analyses
‣ Answer questions to keep the business agile, do research
‣ Online Products
‣ Name search, other upcoming products
‣ Company Dashboard
‣ Birdbrain
46. Questions? Follow me at
twitter.com/kevinweil
‣ P.S. We’re hiring. Help us build the next step: realtime big data analytics.
TM