5. Marketplace Data
● Mission: Empower teams throughout Uber to understand and improve the efficiency of
the marketplace by providing real time data for our Marketplace levers, modeling and
analysis across 15 teams
● Data sources: trip lifecycle, matching, rider, driver, fare, mobile across 100+ Apache
Kafka topics
● Data categories: trip, rider, driver, session, product, geo, time
6. Marketplace Data
Trip Flow Patterns
Demand Patterns
Human Consumption - Incentives
Human Consumption - Regional Metrics
11. ● Metrics are aggregated on the fly
from raw data.
● Metric definitions have a DAG
structure.
● Metrics are usually aggregated
upon a limited set of dimensions,
e.g., time, geo, and etc.
Usage Pattern
13. Solution
● An E2E platform
○ Provides a unified DSL to define metrics for multiple datasources like Kafka, Elastic Search,
and etc.
○ Based on metric definitions, optimizes execution plan based on objectives including
performance, isolation, and etc.
○ Directly generates metrics, instead of generating an intermediate raw dataset and then
aggregating on the fly.
15. ● Formula: a DSL describing an aggregation or arithmetic operation over aggregation
metric_A + metric_B * metric_C - avg((table_A where (exists(column_A) AND (column_B != 0))).column_C)
● Dimensions: over which this metric will be aggregated
Time, geo, product type, and ...
● Sinks: physical storage of this metric
Elastic search, Cassandra, MySQL, and ...
Metric Definition
16. 1. Parse all metric definitions
2. Build a DAG of the most basic Flink operations needed for generating
these metrics.
3. Merging nodes and edges in operation DAG to form a new DAG of
logical metric groups, to reduce duplication, promote isolation and
improve performance.
4. Generate metric group config files and feed them into execution plan
generator.
Execution Optimizer
17. 1. Parse metric group config files.
2. Given configs, build a DAG of stream operators. Within each stream
operator, there could be also a DAG of event operators.
3. Applied pre-defined operator templates and generate the final Flink
execution plan.
Execution Plan Generator
18. ● Given specific set of metric, export SQL for it.
● Given SQL query, analyze percentage overlap with existing data in the
platform
● Convert the adhoc query to leverage existing data in platform
● Auto debugging to find root cause of data abnormality
● Metric discovery
● ….
Functionalities
19. ● Avoid inconsistency (only 1 source of truth)
● Avoid inefficiency in development and duplicate effort
● Avoid cluster resource waste by optimization framework proposed
● Avoid duplicate definition by detecting the duplication when adding or
updating metric
● Improve the metric definition (ex if there is existing metrics A+B to
define C, it’s not good to define from scratch)
● Decouple metrics definition and generation techniques, easy to adopt
new techniques
● Decouple platform from storage
Benefits
20. Driver status summary of one minute:
1.Driver utilization per hexagon on 1 minute
2.Driver utilization per city on 1 minute
3.Driver total total worked minutes per hour
4.Open cars of each hexagon at 1PM in San Francisco
Back-end
service
Apache Kafka
message
Pipeline
sinks
every 4
seconds
Use Case Walkthrough
21. There are two solutions that support these three use cases:
1. Store aggregation for each driver and hexagon based on one minute
into Elasticsearch. Different service queries with additional on-the-fly
aggregation.
2. Streaming jobs to calculate the exact data needed to feed into service
or model. Streaming job maintained by different teams.
Dimension Value
Hexagon id Hex_1
Driver id Driver_1
Status Driving_client
Others.. ..
Use Case Walkthrough
22. Driver team
Location-update
event
Driver summary
per minute
Metric group 1:
driver per hour
Metric group 2:
driver utilization
hexagon
Metric group 3:
driver utilization
city
Driver_id, hex_id,
status, others
Metric group 4:
total open car
hexagon
Pricing team
Surge team
Heath team
Use Case Walkthrough
Our new framework supports all these use cases but without any duplicate
computations or inconsistency.
23. Complex topology
● With the scale and size of Uber’s current metrics, we will encounter very
complex topology once most of the metrics are onboarded onto this
framework.
● We may need to cluster the tasks and split them into a few separate
jobs, connected to Apache Kafka.
Challenges
24. Unstable sinks
● Each metrics group may have a list of different sinks including, Apache
Hive, Apache Kafka, ELK, Cassandra, etc. These storage systems are
owned by different teams at Uber, with different SLAs.
● Slow and unstable storage could bring down the whole pipeline SLA. It’s
desirable to separate the sink from processor. At Uber, we have a
dedicated publisher pipeline to fulfill the sink tasks with the ability to turn
on/off specific sink.
Challenges
25. Metrics evolution
● User will add new metric and update the existing metric. How do we
support this with minimum effort?
○ We need distinguish the golden sets from ad hoc cases. For the
golden set, a bach of metrics changes are reviewed and put into
pipeline.
○ For ad hoc use cases, a user can run the pipeline by themselves but
won’t receive much support.
Challenges
26. Onboarding
● At Uber, there are 15 teams and 20+ services that use Gairos. It’s a big
challenge to ask each client to change their way to consume data.
● We host our gateway to query data in Gairos. It’’s desirable to build
conversion in gateway to convert the old way consuming data to new
approach.
Challenges