SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Real-time Analytics at Facebook:
Data Freeway and Puma


Zheng Shao
12/2/2011
Agenda
 1   Analytics and Real-time

 2   Data Freeway

 3   Puma

 4   Future Works
Analytics and Real-time
what and why
Facebook Insights
• Use cases
▪   Websites/Ads/Apps/Pages
▪   Time series
▪   Demographic break-downs
▪   Unique counts/heavy hitters

• Major challenges
▪   Scalability
▪   Latency
Analytics based on Hadoop/Hive
                                              Hourly           Daily
        seconds            seconds         Copier/Loader   Pipeline Jobs


 HTTP             Scribe             NFS              Hive             MySQL
                                                     Hadoop

• 3000-node Hadoop cluster

• Copier/Loader: Map-Reduce hides machine failures

• Pipeline Jobs: Hive allows SQL-like syntax

• Good scalability, but poor latency! 24 – 48 hours.
How to Get Lower Latency?




• Small-batch Processing                    • Stream Processing
▪   Run Map-reduce/Hive every hour, every   ▪   Aggregate the data as soon as it arrives
    15 min, every 5 min, …
                                            ▪   How to solve the reliability problem?
▪   How do we reduce per-batch
    overhead?
Decisions
• Stream Processing wins!



• Data Freeway
▪   Scalable Data Stream Framework

• Puma
▪   Reliable Stream Aggregation Engine
Data Freeway
scalable data stream
Scribe

                                                  Batch
                                                  Copier
                                                               HDFS

                                                  tail/fopen
 Scribe          Scribe     Scribe
                Mid-Tier                    NFS
 Clients                    Writers                           Log
• Simple push/RPC-based logging system                     Consumer


• Open-sourced in 2008. 100 log categories at that time.

• Routing driven by static configuration.
Data Freeway
                                                      Continuous
                                                        Copier
                  C1            C2         DataNode

                                                                     HDFS
                                                                       PTail
                  C1            C2         DataNode
                                                                      (in the
                                                                       plan)
Scribe                                                  PTail
Clients      Calligraphus   Calligraphus     HDFS
               Mid-tier       Writers
                                                                      Log
                                                                   Consumer
                    Zookeeper

• 9GB/sec at peak, 10 sec latency, 2500 log categories
Calligraphus
• RPC  File System
▪   Each log category is represented by 1 or more FS directories
▪   Each directory is an ordered list of files

• Bucketing support
▪   Application buckets are application-defined shards.
▪   Infrastructure buckets allows log streams from x B/s to x GB/s

• Performance
▪   Latency: Call sync every 7 seconds
▪   Throughput: Easily saturate 1Gbit NIC
Continuous Copier
• File System  File System

• Low latency and smooth network usage

• Deployment
▪   Implemented as long-running map-only job
▪   Can move to any simple job scheduler

• Coordination
▪   Use lock files on HDFS for now
▪   Plan to move to Zookeeper
PTail
                     files        checkpoint

directory

directory

directory



  • File System  Stream (  RPC )

  • Reliability
  ▪   Checkpoints inserted into the data stream
  ▪   Can roll back to tail from any data checkpoints
  ▪   No data loss/duplicates
Channel Comparison
           Push / RPC Pull / FS
Latency      1-2 sec   10 sec
Loss/Dups      Few      None
Robustness     Low      High           Scribe
Complexity    Low       High
                                        Push /
                                         RPC
                      PTail + ScribeSend          Calligraphus
                                      Pull / FS

                                  Continuous Copier
Puma
real-time aggregation/storage
Overview


 Log Stream    Aggregations                   Serving
                               Storage
• ~ 1M log lines per second, but light read

• Multiple Group-By operations per log line

• The first key in Group By is always time/date-related

• Complex aggregations: Unique user count, most frequent
  elements
MySQL and HBase: one page
                   MySQL                 HBase
Parallel           Manual sharding       Automatic
                                         load balancing
Fail-over          Manual master/slave   Automatic
                   switch
Read efficiency    High                  Low
Write efficiency   Medium                High
Columnar support   No                    Yes
Puma2 Architecture




   PTail         Puma2         HBase        Serving

• PTail provide parallel data streams

• For each log line, Puma2 issue “increment” operations to
  HBase. Puma2 is symmetric (no sharding).

• HBase: single increment on multiple columns
Puma2: Pros and Cons
• Pros
▪   Puma2 code is very simple.
▪   Puma2 service is very easy to maintain.

• Cons
▪   “Increment” operation is expensive.
▪   Do not support complex aggregations.
▪   Hacky implementation of “most frequent elements”.
▪   Can cause small data duplicates.
Improvements in Puma2
• Puma2
▪   Batching of requests. Didn‟t work well because of long-tail distribution.

• HBase
▪   “Increment” operation optimized by reducing locks.
▪   HBase region/HDFS file locality; short-circuited read.
▪   Reliability improvements under high load.

• Still not good enough!
Puma3 Architecture



                          PTail          Puma3         HBase
• Puma3 is sharded by aggregation key.

• Each shard is a hashmap in memory.
                                                  Serving
• Each entry in hashmap is a pair of
  an aggregation key and a user-defined aggregation.

• HBase as persistent key-value storage.
Puma3 Architecture



                                    PTail           Puma3           HBase




• Write workflow
                                                                Serving
▪   For each log line, extract the columns for key and value.
▪   Look up in the hashmap and call user-defined aggregation
Puma3 Architecture



                                    PTail              Puma3         HBase


• Checkpoint workflow
▪   Every 5 min, save modified hashmap entries,
    PTail checkpoint to HBase                                    Serving
▪   On startup (after node failure), load from HBase
▪   Get rid of items in memory once the time window has passed
Puma3 Architecture



                                  PTail          Puma3           HBase




• Read workflow
                                                             Serving
▪   Read uncommitted: directly serve from the in-memory hashmap; load
    from Hbase on miss.
▪   Read committed: read from HBase and serve.
Puma3 Architecture



                                    PTail            Puma3        HBase


• Join
▪   Static join table in HBase.
▪   Distributed hash lookup in user-defined function (udf).   Serving

▪   Local cache improves the throughput of the udf a lot.
Puma2 / Puma3 comparison
• Puma3 is much better in write throughput
▪   Use 25% of the boxes to handle the same load.
▪   HBase is really good at write throughput.

• Puma3 needs a lot of memory
▪   Use 60GB of memory per box for the hashmap
▪   SSD can scale to 10x per box.
Puma3 Special Aggregations
• Unique Counts Calculation
▪   Adaptive sampling
▪   Bloom filter (in the plan)

• Most frequent item (in the plan)
▪   Lossy counting
▪   Probabilistic lossy counting
PQL – Puma Query Language
• CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟
  „adid‟, „userid‟);              INSERT INTO l (a, b, c)
                                  SELECT
• CREATE VIEW v AS                   udf.hour(time),
  SELECT *, udf.age(userid)          adid,
  FROM t                             age,
  WHERE udf.age(userid) > 21         count(1),
                                     udf.count_distinc(userid)
                                  FROM v
                                  GROUP BY
• CREATE HBASE TABLE h …             udf.hour(time),
                                     adid,
• CREATE LOGICAL TABLE l …           age;
Future Works
challenges and opportunities
Future Works
• Scheduler Support
▪   Just need simple scheduling because the work load is continuous

• Mass adoption
▪   Migrate most daily reporting queries from Hive

• Open Source
▪   Biggest bottleneck: Java Thrift dependency
▪   Will come one by one
Similar Systems
• STREAM from Stanford

• Flume from Cloudera

• S4 from Yahoo

• Rainbird/Storm from Twitter

• Kafka from Linkedin
Key differences
• Scalable Data Streams
▪   9 GB/sec with < 10 sec of latency
▪   Both Push/RPC-based and Pull/File System-based
▪   Components to support arbitrary combination of channels

• Reliable Stream Aggregations
▪   Good support for Time-based Group By, Table-Stream Lookup Join
▪   Query Language:    Puma : Realtime-MR = Hive : MR
▪   No support for sliding window, stream joins
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Weitere ähnliche Inhalte

Was ist angesagt?

HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Noteslarsgeorge
 
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction HBaseCon
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introductionScott Miao
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduseScott Miao
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0enissoz
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践HBaseCon
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0HBaseCon
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
 
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesAchieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesDataWorks Summit
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBaseCon
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon
 

Was ist angesagt? (20)

HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesAchieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
 

Andere mochten auch

Chris Goundry introduction
Chris Goundry introductionChris Goundry introduction
Chris Goundry introductioncgoundry
 
Respiration (includingFermentation)
Respiration (includingFermentation)Respiration (includingFermentation)
Respiration (includingFermentation)LM9
 
introtomongodb
introtomongodbintrotomongodb
introtomongodbsaikiran
 
투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과2econsulting
 
YGAW (restitution)
YGAW (restitution)YGAW (restitution)
YGAW (restitution)af83media
 
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a BudgetCETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a BudgetChicago eLearning & Technology Showcase
 
Semiconductor06 april11 020511
Semiconductor06 april11 020511Semiconductor06 april11 020511
Semiconductor06 april11 020511Prafulla Tekriwal
 
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...Chicago eLearning & Technology Showcase
 
DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!Sandeep Joshi
 
Feel romania gro wing autumn edition
Feel romania gro wing autumn editionFeel romania gro wing autumn edition
Feel romania gro wing autumn editionTaras
 
How to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 StepsHow to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 StepsErwan Jegouzo
 
140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介MK Translation Firm
 
Moneda
MonedaMoneda
MonedaEver
 

Andere mochten auch (20)

CETS 2011, Sarah Remijan, handout for Webinars Made Easy
CETS 2011, Sarah Remijan, handout for Webinars Made EasyCETS 2011, Sarah Remijan, handout for Webinars Made Easy
CETS 2011, Sarah Remijan, handout for Webinars Made Easy
 
Chris Goundry introduction
Chris Goundry introductionChris Goundry introduction
Chris Goundry introduction
 
Respiration (includingFermentation)
Respiration (includingFermentation)Respiration (includingFermentation)
Respiration (includingFermentation)
 
introtomongodb
introtomongodbintrotomongodb
introtomongodb
 
Cets 2013 graunke using audacity
Cets 2013 graunke using audacityCets 2013 graunke using audacity
Cets 2013 graunke using audacity
 
Cets 2015 ls iaco cheap cheerful
Cets 2015 ls iaco cheap cheerfulCets 2015 ls iaco cheap cheerful
Cets 2015 ls iaco cheap cheerful
 
Xuat huyet nao
Xuat huyet naoXuat huyet nao
Xuat huyet nao
 
SkySimulator & DrFerozMusa
SkySimulator & DrFerozMusaSkySimulator & DrFerozMusa
SkySimulator & DrFerozMusa
 
투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과투이컨설팅 제16회 Y세미나 : 설문결과
투이컨설팅 제16회 Y세미나 : 설문결과
 
CamTech
CamTechCamTech
CamTech
 
YGAW (restitution)
YGAW (restitution)YGAW (restitution)
YGAW (restitution)
 
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a BudgetCETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
CETS 2011, Jeff Graunke, slides for Creating Studio-Quality Audio on a Budget
 
Semiconductor06 april11 020511
Semiconductor06 april11 020511Semiconductor06 april11 020511
Semiconductor06 april11 020511
 
Brouchere
BrouchereBrouchere
Brouchere
 
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
CETS 2012, Christine O'Malley, handouts for Using Variables in Lectora to Col...
 
DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!DevOps Dilemma - Make Dev work with Ops!
DevOps Dilemma - Make Dev work with Ops!
 
Feel romania gro wing autumn edition
Feel romania gro wing autumn editionFeel romania gro wing autumn edition
Feel romania gro wing autumn edition
 
How to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 StepsHow to Get started with Press2Flash in 8 Steps
How to Get started with Press2Flash in 8 Steps
 
140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介140321_株式会社MK翻訳事務所_会社紹介
140321_株式会社MK翻訳事務所_会社紹介
 
Moneda
MonedaMoneda
Moneda
 

Ähnlich wie Hic 2011 realtime_analytics_at_facebook

Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsliqiang xu
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)baggioss
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsYifeng Jiang
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4caizer_x
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012StampedeCon
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Alexey Rybak
 

Ähnlich wie Hic 2011 realtime_analytics_at_facebook (20)

Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
 

Mehr von baggioss

Hdfs写流程异常处理
Hdfs写流程异常处理Hdfs写流程异常处理
Hdfs写流程异常处理baggioss
 
Hbase性能测试文档
Hbase性能测试文档Hbase性能测试文档
Hbase性能测试文档baggioss
 
Hbase使用hadoop分析
Hbase使用hadoop分析Hbase使用hadoop分析
Hbase使用hadoop分析baggioss
 
Hadoop基线选定
Hadoop基线选定Hadoop基线选定
Hadoop基线选定baggioss
 
Hdfs introduction
Hdfs introductionHdfs introduction
Hdfs introductionbaggioss
 
Hdfs原理及实现
Hdfs原理及实现Hdfs原理及实现
Hdfs原理及实现baggioss
 

Mehr von baggioss (10)

Hdfs写流程异常处理
Hdfs写流程异常处理Hdfs写流程异常处理
Hdfs写流程异常处理
 
Hbase性能测试文档
Hbase性能测试文档Hbase性能测试文档
Hbase性能测试文档
 
Hbase使用hadoop分析
Hbase使用hadoop分析Hbase使用hadoop分析
Hbase使用hadoop分析
 
Hadoop基线选定
Hadoop基线选定Hadoop基线选定
Hadoop基线选定
 
Hic2011
Hic2011Hic2011
Hic2011
 
Hdfs introduction
Hdfs introductionHdfs introduction
Hdfs introduction
 
Hbase
HbaseHbase
Hbase
 
Hdfs
HdfsHdfs
Hdfs
 
Hdfs
HdfsHdfs
Hdfs
 
Hdfs原理及实现
Hdfs原理及实现Hdfs原理及实现
Hdfs原理及实现
 

Kürzlich hochgeladen

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Kürzlich hochgeladen (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Hic 2011 realtime_analytics_at_facebook

  • 1. Real-time Analytics at Facebook: Data Freeway and Puma Zheng Shao 12/2/2011
  • 2. Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works
  • 4. Facebook Insights • Use cases ▪ Websites/Ads/Apps/Pages ▪ Time series ▪ Demographic break-downs ▪ Unique counts/heavy hitters • Major challenges ▪ Scalability ▪ Latency
  • 5. Analytics based on Hadoop/Hive Hourly Daily seconds seconds Copier/Loader Pipeline Jobs HTTP Scribe NFS Hive MySQL Hadoop • 3000-node Hadoop cluster • Copier/Loader: Map-Reduce hides machine failures • Pipeline Jobs: Hive allows SQL-like syntax • Good scalability, but poor latency! 24 – 48 hours.
  • 6. How to Get Lower Latency? • Small-batch Processing • Stream Processing ▪ Run Map-reduce/Hive every hour, every ▪ Aggregate the data as soon as it arrives 15 min, every 5 min, … ▪ How to solve the reliability problem? ▪ How do we reduce per-batch overhead?
  • 7. Decisions • Stream Processing wins! • Data Freeway ▪ Scalable Data Stream Framework • Puma ▪ Reliable Stream Aggregation Engine
  • 9. Scribe Batch Copier HDFS tail/fopen Scribe Scribe Scribe Mid-Tier NFS Clients Writers Log • Simple push/RPC-based logging system Consumer • Open-sourced in 2008. 100 log categories at that time. • Routing driven by static configuration.
  • 10. Data Freeway Continuous Copier C1 C2 DataNode HDFS PTail C1 C2 DataNode (in the plan) Scribe PTail Clients Calligraphus Calligraphus HDFS Mid-tier Writers Log Consumer Zookeeper • 9GB/sec at peak, 10 sec latency, 2500 log categories
  • 11. Calligraphus • RPC  File System ▪ Each log category is represented by 1 or more FS directories ▪ Each directory is an ordered list of files • Bucketing support ▪ Application buckets are application-defined shards. ▪ Infrastructure buckets allows log streams from x B/s to x GB/s • Performance ▪ Latency: Call sync every 7 seconds ▪ Throughput: Easily saturate 1Gbit NIC
  • 12. Continuous Copier • File System  File System • Low latency and smooth network usage • Deployment ▪ Implemented as long-running map-only job ▪ Can move to any simple job scheduler • Coordination ▪ Use lock files on HDFS for now ▪ Plan to move to Zookeeper
  • 13. PTail files checkpoint directory directory directory • File System  Stream (  RPC ) • Reliability ▪ Checkpoints inserted into the data stream ▪ Can roll back to tail from any data checkpoints ▪ No data loss/duplicates
  • 14. Channel Comparison Push / RPC Pull / FS Latency 1-2 sec 10 sec Loss/Dups Few None Robustness Low High Scribe Complexity Low High Push / RPC PTail + ScribeSend Calligraphus Pull / FS Continuous Copier
  • 16. Overview Log Stream Aggregations Serving Storage • ~ 1M log lines per second, but light read • Multiple Group-By operations per log line • The first key in Group By is always time/date-related • Complex aggregations: Unique user count, most frequent elements
  • 17. MySQL and HBase: one page MySQL HBase Parallel Manual sharding Automatic load balancing Fail-over Manual master/slave Automatic switch Read efficiency High Low Write efficiency Medium High Columnar support No Yes
  • 18. Puma2 Architecture PTail Puma2 HBase Serving • PTail provide parallel data streams • For each log line, Puma2 issue “increment” operations to HBase. Puma2 is symmetric (no sharding). • HBase: single increment on multiple columns
  • 19. Puma2: Pros and Cons • Pros ▪ Puma2 code is very simple. ▪ Puma2 service is very easy to maintain. • Cons ▪ “Increment” operation is expensive. ▪ Do not support complex aggregations. ▪ Hacky implementation of “most frequent elements”. ▪ Can cause small data duplicates.
  • 20. Improvements in Puma2 • Puma2 ▪ Batching of requests. Didn‟t work well because of long-tail distribution. • HBase ▪ “Increment” operation optimized by reducing locks. ▪ HBase region/HDFS file locality; short-circuited read. ▪ Reliability improvements under high load. • Still not good enough!
  • 21. Puma3 Architecture PTail Puma3 HBase • Puma3 is sharded by aggregation key. • Each shard is a hashmap in memory. Serving • Each entry in hashmap is a pair of an aggregation key and a user-defined aggregation. • HBase as persistent key-value storage.
  • 22. Puma3 Architecture PTail Puma3 HBase • Write workflow Serving ▪ For each log line, extract the columns for key and value. ▪ Look up in the hashmap and call user-defined aggregation
  • 23. Puma3 Architecture PTail Puma3 HBase • Checkpoint workflow ▪ Every 5 min, save modified hashmap entries, PTail checkpoint to HBase Serving ▪ On startup (after node failure), load from HBase ▪ Get rid of items in memory once the time window has passed
  • 24. Puma3 Architecture PTail Puma3 HBase • Read workflow Serving ▪ Read uncommitted: directly serve from the in-memory hashmap; load from Hbase on miss. ▪ Read committed: read from HBase and serve.
  • 25. Puma3 Architecture PTail Puma3 HBase • Join ▪ Static join table in HBase. ▪ Distributed hash lookup in user-defined function (udf). Serving ▪ Local cache improves the throughput of the udf a lot.
  • 26. Puma2 / Puma3 comparison • Puma3 is much better in write throughput ▪ Use 25% of the boxes to handle the same load. ▪ HBase is really good at write throughput. • Puma3 needs a lot of memory ▪ Use 60GB of memory per box for the hashmap ▪ SSD can scale to 10x per box.
  • 27. Puma3 Special Aggregations • Unique Counts Calculation ▪ Adaptive sampling ▪ Bloom filter (in the plan) • Most frequent item (in the plan) ▪ Lossy counting ▪ Probabilistic lossy counting
  • 28. PQL – Puma Query Language • CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟ „adid‟, „userid‟); INSERT INTO l (a, b, c) SELECT • CREATE VIEW v AS udf.hour(time), SELECT *, udf.age(userid) adid, FROM t age, WHERE udf.age(userid) > 21 count(1), udf.count_distinc(userid) FROM v GROUP BY • CREATE HBASE TABLE h … udf.hour(time), adid, • CREATE LOGICAL TABLE l … age;
  • 30. Future Works • Scheduler Support ▪ Just need simple scheduling because the work load is continuous • Mass adoption ▪ Migrate most daily reporting queries from Hive • Open Source ▪ Biggest bottleneck: Java Thrift dependency ▪ Will come one by one
  • 31. Similar Systems • STREAM from Stanford • Flume from Cloudera • S4 from Yahoo • Rainbird/Storm from Twitter • Kafka from Linkedin
  • 32. Key differences • Scalable Data Streams ▪ 9 GB/sec with < 10 sec of latency ▪ Both Push/RPC-based and Pull/File System-based ▪ Components to support arbitrary combination of channels • Reliable Stream Aggregations ▪ Good support for Time-based Group By, Table-Stream Lookup Join ▪ Query Language: Puma : Realtime-MR = Hive : MR ▪ No support for sliding window, stream joins
  • 33. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Hinweis der Redaktion

  1. Good morning everyone. My name’s Zheng Shao.Today I am going to talk about the Real-time Analytics at Facebook.
  2. This is the agenda of the talk. We will start with why we need realtime analytics, then get into details of how we implemented it, and finally future works and comparisons with other systems.
  3. First of all, what is realtime analytics and why we want to do it.
  4. This is the main use case for our analytics. We have a product called Facebook Insights, which allows website owners, advertisers, Facebook application developers, and Facebook page owners to view the time series of impression/click/action counters, the counters broken down by demographics like gender and age, as well as the unique user counters and heavy hitters like most popular urls.The major challenges of building the backend of this insights products are two folds. On one hand, we have huge amount of data coming from both Facebook and non-Facebook websites. On the other hand, customers of the insights product really want to have low-latency summaries, so that they can immediately know how popular a new article or a new game is.
  5. We didhave an existing complete data warehouse solution at Facebook to handle insights workload.In short, log streams got generated from HTTP servers and transferred to NFS via a log collection framework called scribe all within seconds, and then got copied/loaded into Hadoop. Summaries got generated from daily pipeline jobs and eventually got loaded into MySQL for serving.Specifically, we have a 3000-note Hadoop cluster to handle the scalability issue. Copier/Loader are map-reduce jobs which handle machine failures automatically. And Pipeline Jobs are written in Hive which has a SQL-like syntax.Pretty good scalability until we hit the data center power limit. But latency is terrible.
  6. We got 2 ideas on how to improve the latency.The first one is small-batch processing. Instead of using a batch of 1 day, we can produce much smaller batches. The question is how to reduce per-batch overhead, so that tiny batches like 1 min or less makes sense.The second one is stream processing. We can aggregate the data as soon as it arrives. This will produce near realtime results. The question is how to make the system reliable against hardware failures.It turns out the per-batch overhead of Map-Reduce is so high that it’s not practical to have even 5-minute batches on our Hadoop cluster, so we finally decided to go stream processing.
  7. The rest of the talk will focus on two key systems that we built for realtime analytics.The first one, Data Freeway, is a scalable data stream framework on top of Scribe and HDFS.The second one, Puma, is a reliable stream aggregation engine on top of HBase.
  8. This was our old data stream framework. It has several layers of data transportation. The first transport from clients to mid-tier is to reduce the fanout from tens of thousands to hundreds, the second transport is to shuffle the data based on log categories, so that one log category goes to a single writer. Then log data gets written into NFS, which is consumed by batch copier, as well as unix tail/fopen.In short, it’s a simple push/RPC-based logging system. Scribe was open-sourced in 2008, when we have 100 log categories at that time. It quickly got adopted by a lot of other companies. The routing is driven by static configuration which is flexible but have two problems: 1. not scalable because we need to maintain a config for each box in the writers, and a single writer is not scalable; 2. single point of failure in writers.
  9. We came up with Data Freeway in 2011. Right now it’s handling 9GB/sec of data at peak with 10 sec end-to-end latency, and has over 2500 log categories.It contains 4 major components.The first one is scribe. It’s used only at the client, responsible for sending out data via RPCs. The second one is called Calligraphus. It utilizes Zookeeper to manage the ownership of categories, shuffles the data and write to HDFS.The third one is called Continuous Copier, which continuously copies files from one HDFS to another, as the file grows.The fourth one is called PTail, which in parallel tails multiple directories on HDFS and writes out to stdout. Right now we directly ptail from the HDFS written by Calligraphus, but we plan to tail from the HDFS written by Continuous Copier in the future.Let’s get into details of these components.
  10. Calligraphus is responsible for getting log data from RPC and write to File System.Each log category is represented by 1 or more FS directories.Each directory is an ordered list of files, with date in the file name. The files can be compressed.This is a very simple protocol for storing log data. Probably the simplest that I can think of.The most interesting feature about Calligraphus is the bucketing support.We have application buckets, which are application-defined shards. These are used for sharded log consumers. Most of the big log consumers are sharded because their log stream is too big.We also support infrastructure buckets, which allow a single application bucket to have a throughput from several bytes per second to several gigabytes per second. Each infrastructure bucket is a directory. So big streams can go to multiple directories at the same time.Calligraphus has a pretty high performance. We call File System sync every 7 seconds, which is the major source of data latency right now. The network throughput can easily saturate 1Gbit NIC, and we are planning to use 10Gbit NIC some time soon.
  11. Continuous Copier is for continuous data transfer from one File System to another.Compared with the batch-based map-reduce copier, it provide much lower latency as well as smooth network usage.Right now it’s implemented as a long-running map-only job, but it can be easily moved to any simple job scheduling system other than map-reduce.Right now it uses lock files in HDFS for coordination among different nodes, and we plan to move to Zookeeper very soon.The peak throughput of continuous copier in production is about 3GB/sec compressed right now.
  12. The last component in Data Freeway is PTail, which transfers data from a File System to an output stream.The key feature of PTail is the checkpoint. A PTail checkpoint contains the current files and the file offsets in each of the directories. This makes it possible for PTail to roll back to an earlier checkpoint, and reproduce the data stream without any data loss/duplicates at the boundary.
  13. To wrap up Data Freeway, we support 2 channels for data transfers.Push via RPC has lower latency, can potentially have some loss/dups when network has a problem, is less robust with respect to machine failures, and has a very low complexity in code.Pull via FS has a longer latency, but it does not have any loss/dups, and is robust to machine failures. The problems is that the code of the File System, especially HDFS, can be pretty complex, and we still need to identify and fix some bugs there.Data Freeway consists of 4 components that allows data transfer between these 2 channels.
  14. This is the simplified architecture of a typical stream aggregation engine.Log streams get aggregated on a set of machines. The summaries is usually saved to storage for persistence. Online serving get summaries from either the aggregations directly or from the storage. Usually the write throughput is much higher than the read, because analytics data is only viewed by the owners of the website, e.g.In our environment, we have on the order of 1M log lines per second. For each of the log lines, we need to do multiple group-by operations, like by age, or by gender. The first key in group by is always time/date-related which means the summaries will become static after some time. Also we need to support complex aggregations like unique counts and heavy hitters.
  15. Let’s look at our storage choices first.We considered using either MySQL or HBase as our storage engine. HBase is much easier to manage in a distributed environment, which was the major reason that we chose HBase. It also has better write efficiency as well as Columnar support. The read efficiency is inferior because HBase’s cache has less memory space efficiency.
  16. The first architecture that we came up is called Puma2.We run Puma2 on a set of machines, and use PTail to provide parallel data streams. For each log line, Puma2 issues “increment” operations to HBase. Note that Puma2 servers are all symmetric, which means the same row in HBase can be incremented by multiple Puma2 at the same time.HBase can do single increment operation on multiple columns of the same row. So we can use a single increment operation in HBase to handle multiple Group-By’s.Puma2 went into production in March 2011 and is handling 600K log lines on 100 boxes (Puma2 + HBase)
  17. Here are the pros and cons of the Puma2 architecture. The good thing about Puma2 is extremely simple and easy to maintain. The root reason is that Puma2 servers are symmetric and almost stateless. The only state is the PTail checkpoint that is saved to HBase periodically. As a result, we can easily add more boxes or reboot a box if the box went down.However, Puma2 also has its problems. First of all, HBase increment operation is expensive because it’s a read-and-write, and read is expensive. It’s also not possible to support aggregations other than counts, because that need a lot of customized code in HBase. We did a hacky implementation of “most frequent elements” by multiple layers of “frequent element table”. Finally, Puma2 can have small data duplicates because “increments” and checkpoint writes are not in a single transaction.
  18. We did some small improvements to Puma2.On the Puma2 service, an obvious idea is to batch the increment requests to reduce the load on HBase. However, it didn’t work well because of the long-tail distribution of Group-By keys. It also made data less accurate because we cannot save checkpoints in the middle of a batch.On the HBase side, we first optimized the “increment” operation by reducing the number of locks. Another big efficiency improvement came from the short-circuited read from HBase directly to HDFS block files on the disk, instead of via DataNode daemon. We also improved the HBase reliability under the high load.All in all, we are still not happy about Puma2, especially when we try to support unique counters. So we switched to a new architecture called Puma3.
  19. The biggest difference between Puma2 and Puma3 is that in Puma3, we do aggregations in the memory of Puma3 process instead of in HBase. Local memory operations are much faster so that we can achieve a much higher throughput.In order to make in-memory aggregations, we made Puma3 sharded by aggregation key. That means the input PTail data stream has to be sharded as well. That is supported by the application bucketing feature from Calligraphus.Each shard of Puma3 is basically a hashmap in memory. Each entry of the hashmap is a pair of an aggregation key and a user-defined aggregation, which can be count, sum, avg, or anything.We use HBase as a persistent storage but usually don’t read from it.
  20. The write workflow for Puma3 is pretty simple.Basically, for each log line, we extract the columns for key and value. We use the key to look up the in-memory hashmap, and call user-defined aggregation with the value.Note that, since the log streams are sharded by aggregation key, the same aggregation key won’t appear in more than 1 Puma3 processes. This is the key to make Puma3 work.
  21. We checkpoint the state of Puma3 process into HBase every 5 minutes. Basically, we save all the modified hashmap entries as well as the PTail checkpoint. That means if Puma3 crashes and restarts, it can load the state from HBase via sequential read, which is pretty fast in HBase.In order to save memory, we also get rid of hashmap entries from memory once the time window for the aggregation has passed, because we are not going to receive new log lines for that time window again.
  22. There are 2 choices for the read workflow.If we want to read uncommitted aggregations which is usually with 10 seconds of latency, we directly serve from the in-memory hashmap. We go to HBase only for a miss, which will only happen if the time window of the aggregation has passed.If we want to read committed data, Puma3 will read from HBase and serve.Note that uncommitted aggregation result can decrease in value if the Puma3 process dies before making the next checkpoint. We plan to have a cache layer between serving and Puma3 to make sure numbers don’t decrease.
  23. Puma3 also supports joining with a static table in HBase. The join key has to be the row key in the static HBase table. It’s implemented as a simple distributed hash lookup in a user-defined function. We have found that local cache improves the throughput of the udf a lot.
  24. Comparing Puma2 and Puma3, we found that Puma3 is much better in writer throughput. We only need to use 25% of the boxes to handle the same work load. The main reason is that HBase is really good at write throughput.At the same time, Puma3 needs a lot of memory. Basically, all aggregations that can change needs to be stored in memory, to ensure the log stream write throughput. Right now we use 60GB of memory per box for the hashmap. In the future, we may use SSD that can easily scale to 10x more space per box.
  25. With Puma3, we can easily support these special aggregations, with some approximation.For unique counts, we have implemented a simple adaptive sampling algorithm, that samples more aggressively when the numbe of unique item increases. We can also easily implement the standard bloom filter for counting.For the most frequent items, we plan to implement the classic lossy counting algorithm and probabilistic lossy counting algorithm.
  26. The most important feature of Puma that distinguishes it from other stream processing projects is the language.We have built a SQL-like query language that allows us to define the input stream, the output table, as well as the query itself. Note that the query contains user-defined functions for Join as well as Aggregations.Puma3 is right now in pre-production stage. We plan to push it out in production as soon as we verified all the summaries against Puma2 and Hive.
  27. Here are a list of things we plan to do next.First is simple scheduling for Puma3. We just need very simple scheduling because the work load is continuous. Most likely we will reuse some existing frameworks.Second is the mass adoption inside the company. We plan to migrate most daily reporting queries from Hive, as long as the query is simple enough to be supported by Puma. This will reduce the latency as well as improve the efficiency, because of the saving in compression/decompression.The third one is open-source. Right now, the biggest bottleneck is Java Thrift which has diverged between Facebook and open-source. We plan to open-source the projects one by one, starting from Calligraphus.
  28. There are lots of similar systems in academia as well as other companies.
  29. Instead of comparing them one by one, I will end the presentation by a summary of the key differences.Data Freeway is a scalable data stream framework with 9GB/sec throughput and 10 sec latency. It supports both Push/RPC-based and Pull/File System-based channels. We have components to support arbitrary combination of channels to adapt to the use case.Puma is a reliable stream aggregation engine. It has good support for time-window-based Group By as well as table-stream Lookup Join. It has a query language that makes Puma comparable to Hive when comparing Realtime-MR and MR. Puma has no support and no plan to support sliding window and stream joins because those are very hard problems that we don’t see in our environment.