SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Hadoop at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter




                           TM
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage
‣   Data Analysis
‣   Data Products
The Twitter Data Lifecycle
‣   Data Input: Scribe, Crane
‣   Data Storage: Elephant Bird, HBase
‣   Data Analysis: Pig, Oink
‣   Data Products: Birdbrain



1 Community Open Source
2 Twitter Open Source (or soon)
My Background
‣   Studied Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh routing algorithms,
    GBs of data
‣   Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣   Twitter: Hadoop, Pig, machine learning, visualization, social
    graph analysis, (soon) PBs of data
The Twitter Data Lifecycle
‣   Data Input: Scribe, Crane
‣   Data Storage
‣   Data Analysis
‣   Data Products



1 Community Open Source
2 Twitter Open Source
What Data?
‣   Two main kinds of raw data
‣   	   Logs
‣   	   Tabular data
Logs
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
Logs
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
‣   Resources overwhelmed
‣   Lost data
Scribe
‣   Scribe daemon runs locally; reliable in network outage
‣   Nodes only know downstream
                                              FE             FE           FE
    writer; hierarchical, scalable
‣   Pluggable outputs, per category

                                                    Agg             Agg




                                       File                  HDFS
Scribe at Twitter
‣   Solved our problem, opened new vistas
‣   Currently 57 different categories logged from multiple sources
‣     FE: Javascript, Ruby on Rails
‣     Middle tier: Ruby on Rails, Scala
‣     Backend: Scala, Java, C++
‣   7 TB/day into HDFS


‣   Log first, ask questions later.
Scribe at Twitter
‣   We’ve contributed to it as we’ve used       it1

‣       Improved logging, monitoring, writing to HDFS, compression
‣       Added ZooKeeper-based config
‣       Continuing to work with FB on patches


‣   Also: working with Cloudera to evaluate Flume

    1 http://github.com/traviscrawford/scribe
Tabular Data
‣   Most site data is in MySQL
‣     Tweets, users, devices, client applications, etc
‣   Need to move it between MySQL and HDFS
‣     Also between MySQL and HBase, or MySQL and MySQL


‣   Crane: configuration driven ETL tool
Crane

                           Driver
Source        Configuration/Batch Management
                                                   Sink
                  ZooKeeper Registration


                         Transform
             Extract                     Load
           Protobuf P1               Protobuf P2
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣   Transform
‣   	   IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣   Transform
‣   	   IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
‣   Load
‣   	   MySQL, Local file, Stdout, HDFS, HBase
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣   Transform
‣   	   IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
‣   Load
‣   	   MySQL, Local file, Stdout, HDFS, HBase
‣   ZooKeeper coordination, intelligent date management
‣   	   Run all the time from multiple servers, self healing
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage: Elephant Bird, HBase
‣   Data Analysis
‣   Data Products



1 Community Open Source
2 Twitter Open Source
Storage Basics
‣   Incoming data: 7 TB/day
‣   LZO encode everything
‣   	   Save 3-4x on storage, pay little CPU
‣   	   Splittable!1

‣   	   IO-bound jobs ==> 3-4x perf increase



1   http://www.github.com/kevinweil/hadoop-lzo
http://www.flickr.com/photos/jagadish/3072134867/




Elephant Bird




1 http://github.com/kevinweil/elephant-bird
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of data structures
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of data structures
‣   Why shouldn’t we just continue, and codegen more glue?
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of data structures
‣   Why shouldn’t we just continue, and codegen more glue?
‣     InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig
    HBaseLoaders
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of data structures
‣   Why shouldn’t we just continue, and codegen more glue?
‣     InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig
    HBaseLoaders
‣     Also now does part of this with Thrift, soon Avro
‣     And JSON, W3C Logs
Challenge: Mutable Data
‣   HDFS is write-once: no seek on write, no append (yet)
‣     Logs are easy.
‣     But our tables change.
Challenge: Mutable Data
‣   HDFS is write-once: no seek on write, no append (yet)
‣       Logs are easy.
‣       But our tables change.
‣   Handling rapidly changing data in HDFS: not trivial.
‣   
    Don’t worry about updated data
‣   
    Refresh entire dataset
‣   
 Download updates, tombstone old versions of data, ensure jobs
    only run over current versions of data, occasionally rewrite full dataset
Challenge: Mutable Data
‣   HDFS is write-once: no seek on write, no append (yet)
‣     Logs are easy.
‣     But our tables change.
‣   Handling changing data in HDFS: not trivial.
HBase
‣   Has already solved the update problem
‣     Bonus: low-latency query API
‣     Bonus: rich, BigTable-style data model based on column families
HBase at Twitter
‣   Crane loads data directly into HBase
‣      One CF for protobuf bytes, one CF to denormalize columns for
    indexing or quicker batch access
‣     Processing updates transparent, so we always have accurate data in
    HBase
‣    Pig Loader for HBase in Elephant Bird makes integration with
    existing analyses easy
HBase at Twitter
‣   Crane loads data directly into HBase
‣      One CF for protobuf bytes, one CF to denormalize columns for
    indexing or quicker batch access
‣     Processing updates transparent, so we always have accurate data in
    HBase
‣    Pig Loader for HBase in Elephant Bird
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage
‣   Data Analysis: Pig, Oink
‣   Data Products



1 Community Open Source
2 Twitter Open Source
Enter Pig

            ‣   High level language
            ‣   Transformations on sets of records
            ‣   Process data one step at a time
            ‣   UDFs are first-class citizens
            ‣   Easier than SQL?
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Democratizes Large-scale Data
Analysis
‣   The Pig version is:
‣     5% of the code
‣     5% of the time
‣     Within 30% of the execution time.
‣   Innovation increasingly driven from large-scale data analysis
‣     Need fast iteration to understand the right questions
‣     More minds contributing = more value from your data
Pig Examples
‣   Using the HBase Loader




‣   Using the protobuf loaders
Pig Workflow
‣   Oink: framework around Pig for loading, combining, running,
    post-processing
‣   	   Everyone I know has one of these
‣   	   Points to an opening for innovation; discussion beginning
‣   
   Something we’re looking at: Ruby DSL for Pig,   Piglet1




1 http://github.com/ningliang/piglet
Counting Big Data
‣                standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣   How many searches happen each day on Twitter?
‣   How many unique queries, how many unique users?
‣   What is their geographic distribution?
Correlating Big Data
‣                 probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
‣   Which features do successful users use often?
‣   Search corrections, search suggestions
‣   A/B testing
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
‣     From the ratio of followers/following?
‣   What graph structures lead to successful networks?
‣   User reputation
Research on Big Data
‣            prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
‣   Machine learning
‣   Language detection
‣   ... the list goes on.
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage
‣   Data Analysis
‣   Data Products: Birdbrain



1 Community Open Source
2 Twitter Open Source
Data Products
‣   Ad Hoc Analyses
‣     Answer questions to keep the business agile, do research
‣   Online Products
‣     Name search, other upcoming products
‣   Company Dashboard
‣     Birdbrain
Questions?                                          Follow me at
                                                           twitter.com/kevinweil




‣   P.S. We’re hiring. Help us build the next step: realtime big data analytics.

                                                                        TM

Weitere ähnliche Inhalte

Was ist angesagt?

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloudrhatr
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 

Was ist angesagt? (20)

Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 

Andere mochten auch

Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Kevin Weil
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Kevin Weil
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Kevin Weil
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby建興 王
 
Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Kevin Weil
 
全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用建興 王
 
大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術Wei-Yu Chen
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)Kevin Weil
 
Using CDN to improve performance
Using CDN to improve performanceUsing CDN to improve performance
Using CDN to improve performanceGea-Suan Lin
 

Andere mochten auch (20)

Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby
 
Hadoop.TW : Now and Future
Hadoop.TW : Now and FutureHadoop.TW : Now and Future
Hadoop.TW : Now and Future
 
Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010
 
全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用
 
大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
 
Using CDN to improve performance
Using CDN to improve performanceUsing CDN to improve performance
Using CDN to improve performance
 

Ähnlich wie Hadoop at Twitter (Hadoop Summit 2010)

Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02matrixvn
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesVladimír Schreiner
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingYahoo Developer Network
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerMark Kromer
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 

Ähnlich wie Hadoop at Twitter (Hadoop Summit 2010) (20)

Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
Realtimeanalyticsattwitter strata2011-110204123031-phpapp02
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
Migrating pipelines into Docker
Migrating pipelines into DockerMigrating pipelines into Docker
Migrating pipelines into Docker
 
Streams
StreamsStreams
Streams
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Hive at booking
Hive at bookingHive at booking
Hive at booking
 

Hadoop at Twitter (Hadoop Summit 2010)

  • 1. Hadoop at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM
  • 2. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products
  • 3. The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis: Pig, Oink ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source (or soon)
  • 4. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data
  • 5. The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • 6. What Data? ‣ Two main kinds of raw data ‣ Logs ‣ Tabular data
  • 7. Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
  • 8. Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
  • 9. Scribe ‣ Scribe daemon runs locally; reliable in network outage ‣ Nodes only know downstream FE FE FE writer; hierarchical, scalable ‣ Pluggable outputs, per category Agg Agg File HDFS
  • 10. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 57 different categories logged from multiple sources ‣ FE: Javascript, Ruby on Rails ‣ Middle tier: Ruby on Rails, Scala ‣ Backend: Scala, Java, C++ ‣ 7 TB/day into HDFS ‣ Log first, ask questions later.
  • 11. Scribe at Twitter ‣ We’ve contributed to it as we’ve used it1 ‣ Improved logging, monitoring, writing to HDFS, compression ‣ Added ZooKeeper-based config ‣ Continuing to work with FB on patches ‣ Also: working with Cloudera to evaluate Flume 1 http://github.com/traviscrawford/scribe
  • 12. Tabular Data ‣ Most site data is in MySQL ‣ Tweets, users, devices, client applications, etc ‣ Need to move it between MySQL and HDFS ‣ Also between MySQL and HBase, or MySQL and MySQL ‣ Crane: configuration driven ETL tool
  • 13. Crane Driver Source Configuration/Batch Management Sink ZooKeeper Registration Transform Extract Load Protobuf P1 Protobuf P2
  • 14. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights
  • 15. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
  • 16. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase
  • 17. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase ‣ ZooKeeper coordination, intelligent date management ‣ Run all the time from multiple servers, self healing
  • 18. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • 19. Storage Basics ‣ Incoming data: 7 TB/day ‣ LZO encode everything ‣ Save 3-4x on storage, pay little CPU ‣ Splittable!1 ‣ IO-bound jobs ==> 3-4x perf increase 1 http://www.github.com/kevinweil/hadoop-lzo
  • 21. Elephant Bird ‣ We have data coming in as protocol buffers via Crane...
  • 22. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures
  • 23. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue?
  • 24. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
  • 25. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders ‣ Also now does part of this with Thrift, soon Avro ‣ And JSON, W3C Logs
  • 26. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change.
  • 27. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling rapidly changing data in HDFS: not trivial. ‣ Don’t worry about updated data ‣ Refresh entire dataset ‣ Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset
  • 28. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling changing data in HDFS: not trivial.
  • 29. HBase ‣ Has already solved the update problem ‣ Bonus: low-latency query API ‣ Bonus: rich, BigTable-style data model based on column families
  • 30. HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy
  • 31. HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird
  • 32. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis: Pig, Oink ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • 33. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ UDFs are first-class citizens ‣ Easier than SQL?
  • 34. Why Pig? ‣ Because I bet you can read the following script.
  • 35. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 37. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the time ‣ Within 30% of the execution time. ‣ Innovation increasingly driven from large-scale data analysis ‣ Need fast iteration to understand the right questions ‣ More minds contributing = more value from your data
  • 38. Pig Examples ‣ Using the HBase Loader ‣ Using the protobuf loaders
  • 39. Pig Workflow ‣ Oink: framework around Pig for loading, combining, running, post-processing ‣ Everyone I know has one of these ‣ Points to an opening for innovation; discussion beginning ‣ Something we’re looking at: Ruby DSL for Pig, Piglet1 1 http://github.com/ningliang/piglet
  • 40. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution?
  • 41. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing
  • 42. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation
  • 43. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on.
  • 44. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source
  • 45. Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile, do research ‣ Online Products ‣ Name search, other upcoming products ‣ Company Dashboard ‣ Birdbrain
  • 46. Questions? Follow me at twitter.com/kevinweil ‣ P.S. We’re hiring. Help us build the next step: realtime big data analytics. TM

Hinweis der Redaktion