SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Uber Business Metrics
Generation and Management
Through Apache Flink
Uber
Engineer at Uber Marketplace
Wenrui Meng
Miao Yu
2018 Flink Forward Asia
Uber Mission
We ignite opportunity by setting the world in motion.
我们构建交通服务平台,为全球点燃机遇之火.
● Marketplace
● Project background
● Architecture
● User case
● Challenges
● Future work
Outline
Marketplace Team
Marketplace
Platform
Marketplace
Data
Marketplace
Dynamics
Marketplace
Fare
Marketplace Data
● Mission: Empower teams throughout Uber to understand and improve the efficiency of
the marketplace by providing real time data for our Marketplace levers, modeling and
analysis across 15 teams
● Data sources: trip lifecycle, matching, rider, driver, fare, mobile across 100+ Apache
Kafka topics
● Data categories: trip, rider, driver, session, product, geo, time
Marketplace Data
Trip Flow Patterns
Demand Patterns
Human Consumption - Incentives
Human Consumption - Regional Metrics
Marketplace Data
attribution note
Real-Time Analytics
● Usage pattern
● Issues
● Solution
Project Background
● Metrics are aggregated on the fly
from raw data.
● Metric definitions have a DAG
structure.
● Metrics are usually aggregated
upon a limited set of dimensions,
e.g., time, geo, and etc.
Usage Pattern
● Resource waste
● Bad performance
● Inconsistency
● ...
Issues
Solution
● An E2E platform
○ Provides a unified DSL to define metrics for multiple datasources like Kafka, Elastic Search,
and etc.
○ Based on metric definitions, optimizes execution plan based on objectives including
performance, isolation, and etc.
○ Directly generates metrics, instead of generating an intermediate raw dataset and then
aggregating on the fly.
Architecture
● Formula: a DSL describing an aggregation or arithmetic operation over aggregation
metric_A + metric_B * metric_C - avg((table_A where (exists(column_A) AND (column_B != 0))).column_C)
● Dimensions: over which this metric will be aggregated
Time, geo, product type, and ...
● Sinks: physical storage of this metric
Elastic search, Cassandra, MySQL, and ...
Metric Definition
1. Parse all metric definitions
2. Build a DAG of the most basic Flink operations needed for generating
these metrics.
3. Merging nodes and edges in operation DAG to form a new DAG of
logical metric groups, to reduce duplication, promote isolation and
improve performance.
4. Generate metric group config files and feed them into execution plan
generator.
Execution Optimizer
1. Parse metric group config files.
2. Given configs, build a DAG of stream operators. Within each stream
operator, there could be also a DAG of event operators.
3. Applied pre-defined operator templates and generate the final Flink
execution plan.
Execution Plan Generator
● Given specific set of metric, export SQL for it.
● Given SQL query, analyze percentage overlap with existing data in the
platform
● Convert the adhoc query to leverage existing data in platform
● Auto debugging to find root cause of data abnormality
● Metric discovery
● ….
Functionalities
● Avoid inconsistency (only 1 source of truth)
● Avoid inefficiency in development and duplicate effort
● Avoid cluster resource waste by optimization framework proposed
● Avoid duplicate definition by detecting the duplication when adding or
updating metric
● Improve the metric definition (ex if there is existing metrics A+B to
define C, it’s not good to define from scratch)
● Decouple metrics definition and generation techniques, easy to adopt
new techniques
● Decouple platform from storage
Benefits
Driver status summary of one minute:
1.Driver utilization per hexagon on 1 minute
2.Driver utilization per city on 1 minute
3.Driver total total worked minutes per hour
4.Open cars of each hexagon at 1PM in San Francisco
Back-end
service
Apache Kafka
message
Pipeline
sinks
every 4
seconds
Use Case Walkthrough
There are two solutions that support these three use cases:
1. Store aggregation for each driver and hexagon based on one minute
into Elasticsearch. Different service queries with additional on-the-fly
aggregation.
2. Streaming jobs to calculate the exact data needed to feed into service
or model. Streaming job maintained by different teams.
Dimension Value
Hexagon id Hex_1
Driver id Driver_1
Status Driving_client
Others.. ..
Use Case Walkthrough
Driver team
Location-update
event
Driver summary
per minute
Metric group 1:
driver per hour
Metric group 2:
driver utilization
hexagon
Metric group 3:
driver utilization
city
Driver_id, hex_id,
status, others
Metric group 4:
total open car
hexagon
Pricing team
Surge team
Heath team
Use Case Walkthrough
Our new framework supports all these use cases but without any duplicate
computations or inconsistency.
Complex topology
● With the scale and size of Uber’s current metrics, we will encounter very
complex topology once most of the metrics are onboarded onto this
framework.
● We may need to cluster the tasks and split them into a few separate
jobs, connected to Apache Kafka.
Challenges
Unstable sinks
● Each metrics group may have a list of different sinks including, Apache
Hive, Apache Kafka, ELK, Cassandra, etc. These storage systems are
owned by different teams at Uber, with different SLAs.
● Slow and unstable storage could bring down the whole pipeline SLA. It’s
desirable to separate the sink from processor. At Uber, we have a
dedicated publisher pipeline to fulfill the sink tasks with the ability to turn
on/off specific sink.
Challenges
Metrics evolution
● User will add new metric and update the existing metric. How do we
support this with minimum effort?
○ We need distinguish the golden sets from ad hoc cases. For the
golden set, a bach of metrics changes are reviewed and put into
pipeline.
○ For ad hoc use cases, a user can run the pipeline by themselves but
won’t receive much support.
Challenges
Onboarding
● At Uber, there are 15 teams and 20+ services that use Gairos. It’s a big
challenge to ask each client to change their way to consume data.
● We host our gateway to query data in Gairos. It’’s desirable to build
conversion in gateway to convert the old way consuming data to new
approach.
Challenges
● Improve execution optimizer
● Improve tooling
Future Work
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink

Weitere ähnliche Inhalte

Was ist angesagt?

Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
5 Simple Steps to Migrate to AWS – Zerto
  5 Simple Steps to Migrate to AWS – Zerto  5 Simple Steps to Migrate to AWS – Zerto
5 Simple Steps to Migrate to AWS – ZertoAmazon Web Services
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLYugabyte
 
Planning A Cloud Implementation
Planning A Cloud ImplementationPlanning A Cloud Implementation
Planning A Cloud ImplementationRex Wang
 
금융 서비스 패러다임의 전환 가속화 시대, 신한금융투자의 Cloud First 전략 - 신중훈 AWS 솔루션즈 아키텍트 / 최성봉 클라우...
금융 서비스 패러다임의 전환 가속화 시대, 신한금융투자의 Cloud First 전략  - 신중훈 AWS 솔루션즈 아키텍트 / 최성봉 클라우...금융 서비스 패러다임의 전환 가속화 시대, 신한금융투자의 Cloud First 전략  - 신중훈 AWS 솔루션즈 아키텍트 / 최성봉 클라우...
금융 서비스 패러다임의 전환 가속화 시대, 신한금융투자의 Cloud First 전략 - 신중훈 AWS 솔루션즈 아키텍트 / 최성봉 클라우...Amazon Web Services Korea
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
AWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdfAWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdfSrinjoySaha12
 
Manage Queries, and Audit Usage & Control Costs at Scale on Amazon Athena (AN...
Manage Queries, and Audit Usage & Control Costs at Scale on Amazon Athena (AN...Manage Queries, and Audit Usage & Control Costs at Scale on Amazon Athena (AN...
Manage Queries, and Audit Usage & Control Costs at Scale on Amazon Athena (AN...Amazon Web Services
 
Microsoft Azure IaaS and Terraform
Microsoft Azure IaaS and TerraformMicrosoft Azure IaaS and Terraform
Microsoft Azure IaaS and TerraformAlex Mags
 
VMware Tanzu Introduction
VMware Tanzu IntroductionVMware Tanzu Introduction
VMware Tanzu IntroductionVMware Tanzu
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Kai Wähner
 
Cloud Native, Cloud First and Hybrid: How Different Organizations are Approac...
Cloud Native, Cloud First and Hybrid: How Different Organizations are Approac...Cloud Native, Cloud First and Hybrid: How Different Organizations are Approac...
Cloud Native, Cloud First and Hybrid: How Different Organizations are Approac...Amazon Web Services
 
Kubernetes design principles, patterns and ecosystem
Kubernetes design principles, patterns and ecosystemKubernetes design principles, patterns and ecosystem
Kubernetes design principles, patterns and ecosystemSreenivas Makam
 

Was ist angesagt? (20)

Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
5 Simple Steps to Migrate to AWS – Zerto
  5 Simple Steps to Migrate to AWS – Zerto  5 Simple Steps to Migrate to AWS – Zerto
5 Simple Steps to Migrate to AWS – Zerto
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQL
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
 
Planning A Cloud Implementation
Planning A Cloud ImplementationPlanning A Cloud Implementation
Planning A Cloud Implementation
 
금융 서비스 패러다임의 전환 가속화 시대, 신한금융투자의 Cloud First 전략 - 신중훈 AWS 솔루션즈 아키텍트 / 최성봉 클라우...
금융 서비스 패러다임의 전환 가속화 시대, 신한금융투자의 Cloud First 전략  - 신중훈 AWS 솔루션즈 아키텍트 / 최성봉 클라우...금융 서비스 패러다임의 전환 가속화 시대, 신한금융투자의 Cloud First 전략  - 신중훈 AWS 솔루션즈 아키텍트 / 최성봉 클라우...
금융 서비스 패러다임의 전환 가속화 시대, 신한금융투자의 Cloud First 전략 - 신중훈 AWS 솔루션즈 아키텍트 / 최성봉 클라우...
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
AWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdfAWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdf
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 
Manage Queries, and Audit Usage & Control Costs at Scale on Amazon Athena (AN...
Manage Queries, and Audit Usage & Control Costs at Scale on Amazon Athena (AN...Manage Queries, and Audit Usage & Control Costs at Scale on Amazon Athena (AN...
Manage Queries, and Audit Usage & Control Costs at Scale on Amazon Athena (AN...
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Microsoft Azure IaaS and Terraform
Microsoft Azure IaaS and TerraformMicrosoft Azure IaaS and Terraform
Microsoft Azure IaaS and Terraform
 
VMware Tanzu Introduction
VMware Tanzu IntroductionVMware Tanzu Introduction
VMware Tanzu Introduction
 
AWS vs. Azure
AWS vs. AzureAWS vs. Azure
AWS vs. Azure
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
 
Cloud Native, Cloud First and Hybrid: How Different Organizations are Approac...
Cloud Native, Cloud First and Hybrid: How Different Organizations are Approac...Cloud Native, Cloud First and Hybrid: How Different Organizations are Approac...
Cloud Native, Cloud First and Hybrid: How Different Organizations are Approac...
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Kubernetes design principles, patterns and ecosystem
Kubernetes design principles, patterns and ecosystemKubernetes design principles, patterns and ecosystem
Kubernetes design principles, patterns and ecosystem
 

Ähnlich wie Uber Business Metrics Generation and Management Through Apache Flink

Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
Exploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access LayerExploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access LayerSambit Banerjee
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of PrometheusGeorge Luong
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Modernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoringModernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoringManageEngine, Zoho Corporation
 
Adaptive Scaling of Microgateways on Kubernetes
Adaptive Scaling of Microgateways on KubernetesAdaptive Scaling of Microgateways on Kubernetes
Adaptive Scaling of Microgateways on KubernetesWSO2
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareApache Apex
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupMingmin Chen
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsNECST Lab @ Politecnico di Milano
 
Autoscaling in kubernetes v1
Autoscaling in kubernetes v1Autoscaling in kubernetes v1
Autoscaling in kubernetes v1JurajHantk
 
2022-05-23-DevOps pro Europe - Managing Apps at scale.pdf
2022-05-23-DevOps pro Europe - Managing Apps at scale.pdf2022-05-23-DevOps pro Europe - Managing Apps at scale.pdf
2022-05-23-DevOps pro Europe - Managing Apps at scale.pdfŁukasz Piątkowski
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...Fatima Qayyum
 
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud
Kai Wähner
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 

Ähnlich wie Uber Business Metrics Generation and Management Through Apache Flink (20)

Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Exploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access LayerExploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access Layer
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of Prometheus
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Modernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoringModernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoring
 
Adaptive Scaling of Microgateways on Kubernetes
Adaptive Scaling of Microgateways on KubernetesAdaptive Scaling of Microgateways on Kubernetes
Adaptive Scaling of Microgateways on Kubernetes
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
 
Autoscaling in Kubernetes
Autoscaling in KubernetesAutoscaling in Kubernetes
Autoscaling in Kubernetes
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
 
Autoscaling in kubernetes v1
Autoscaling in kubernetes v1Autoscaling in kubernetes v1
Autoscaling in kubernetes v1
 
2022-05-23-DevOps pro Europe - Managing Apps at scale.pdf
2022-05-23-DevOps pro Europe - Managing Apps at scale.pdf2022-05-23-DevOps pro Europe - Managing Apps at scale.pdf
2022-05-23-DevOps pro Europe - Managing Apps at scale.pdf
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
A Low-Cost IoT Application for the Urban Traffic of Vehicles, Based on Wirele...
 
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud

 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 

Kürzlich hochgeladen

Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 

Kürzlich hochgeladen (20)

Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 

Uber Business Metrics Generation and Management Through Apache Flink

  • 1. Uber Business Metrics Generation and Management Through Apache Flink Uber Engineer at Uber Marketplace Wenrui Meng Miao Yu 2018 Flink Forward Asia
  • 2. Uber Mission We ignite opportunity by setting the world in motion. 我们构建交通服务平台,为全球点燃机遇之火.
  • 3. ● Marketplace ● Project background ● Architecture ● User case ● Challenges ● Future work Outline
  • 5. Marketplace Data ● Mission: Empower teams throughout Uber to understand and improve the efficiency of the marketplace by providing real time data for our Marketplace levers, modeling and analysis across 15 teams ● Data sources: trip lifecycle, matching, rider, driver, fare, mobile across 100+ Apache Kafka topics ● Data categories: trip, rider, driver, session, product, geo, time
  • 6. Marketplace Data Trip Flow Patterns Demand Patterns Human Consumption - Incentives Human Consumption - Regional Metrics
  • 8.
  • 10. ● Usage pattern ● Issues ● Solution Project Background
  • 11. ● Metrics are aggregated on the fly from raw data. ● Metric definitions have a DAG structure. ● Metrics are usually aggregated upon a limited set of dimensions, e.g., time, geo, and etc. Usage Pattern
  • 12. ● Resource waste ● Bad performance ● Inconsistency ● ... Issues
  • 13. Solution ● An E2E platform ○ Provides a unified DSL to define metrics for multiple datasources like Kafka, Elastic Search, and etc. ○ Based on metric definitions, optimizes execution plan based on objectives including performance, isolation, and etc. ○ Directly generates metrics, instead of generating an intermediate raw dataset and then aggregating on the fly.
  • 15. ● Formula: a DSL describing an aggregation or arithmetic operation over aggregation metric_A + metric_B * metric_C - avg((table_A where (exists(column_A) AND (column_B != 0))).column_C) ● Dimensions: over which this metric will be aggregated Time, geo, product type, and ... ● Sinks: physical storage of this metric Elastic search, Cassandra, MySQL, and ... Metric Definition
  • 16. 1. Parse all metric definitions 2. Build a DAG of the most basic Flink operations needed for generating these metrics. 3. Merging nodes and edges in operation DAG to form a new DAG of logical metric groups, to reduce duplication, promote isolation and improve performance. 4. Generate metric group config files and feed them into execution plan generator. Execution Optimizer
  • 17. 1. Parse metric group config files. 2. Given configs, build a DAG of stream operators. Within each stream operator, there could be also a DAG of event operators. 3. Applied pre-defined operator templates and generate the final Flink execution plan. Execution Plan Generator
  • 18. ● Given specific set of metric, export SQL for it. ● Given SQL query, analyze percentage overlap with existing data in the platform ● Convert the adhoc query to leverage existing data in platform ● Auto debugging to find root cause of data abnormality ● Metric discovery ● …. Functionalities
  • 19. ● Avoid inconsistency (only 1 source of truth) ● Avoid inefficiency in development and duplicate effort ● Avoid cluster resource waste by optimization framework proposed ● Avoid duplicate definition by detecting the duplication when adding or updating metric ● Improve the metric definition (ex if there is existing metrics A+B to define C, it’s not good to define from scratch) ● Decouple metrics definition and generation techniques, easy to adopt new techniques ● Decouple platform from storage Benefits
  • 20. Driver status summary of one minute: 1.Driver utilization per hexagon on 1 minute 2.Driver utilization per city on 1 minute 3.Driver total total worked minutes per hour 4.Open cars of each hexagon at 1PM in San Francisco Back-end service Apache Kafka message Pipeline sinks every 4 seconds Use Case Walkthrough
  • 21. There are two solutions that support these three use cases: 1. Store aggregation for each driver and hexagon based on one minute into Elasticsearch. Different service queries with additional on-the-fly aggregation. 2. Streaming jobs to calculate the exact data needed to feed into service or model. Streaming job maintained by different teams. Dimension Value Hexagon id Hex_1 Driver id Driver_1 Status Driving_client Others.. .. Use Case Walkthrough
  • 22. Driver team Location-update event Driver summary per minute Metric group 1: driver per hour Metric group 2: driver utilization hexagon Metric group 3: driver utilization city Driver_id, hex_id, status, others Metric group 4: total open car hexagon Pricing team Surge team Heath team Use Case Walkthrough Our new framework supports all these use cases but without any duplicate computations or inconsistency.
  • 23. Complex topology ● With the scale and size of Uber’s current metrics, we will encounter very complex topology once most of the metrics are onboarded onto this framework. ● We may need to cluster the tasks and split them into a few separate jobs, connected to Apache Kafka. Challenges
  • 24. Unstable sinks ● Each metrics group may have a list of different sinks including, Apache Hive, Apache Kafka, ELK, Cassandra, etc. These storage systems are owned by different teams at Uber, with different SLAs. ● Slow and unstable storage could bring down the whole pipeline SLA. It’s desirable to separate the sink from processor. At Uber, we have a dedicated publisher pipeline to fulfill the sink tasks with the ability to turn on/off specific sink. Challenges
  • 25. Metrics evolution ● User will add new metric and update the existing metric. How do we support this with minimum effort? ○ We need distinguish the golden sets from ad hoc cases. For the golden set, a bach of metrics changes are reviewed and put into pipeline. ○ For ad hoc use cases, a user can run the pipeline by themselves but won’t receive much support. Challenges
  • 26. Onboarding ● At Uber, there are 15 teams and 20+ services that use Gairos. It’s a big challenge to ask each client to change their way to consume data. ● We host our gateway to query data in Gairos. It’’s desirable to build conversion in gateway to convert the old way consuming data to new approach. Challenges
  • 27. ● Improve execution optimizer ● Improve tooling Future Work