Zipline is Airbnb’s data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Zipline reduces this task from months to days – by making the process declarative. It allows data scientists to easily define features in a simple configuration language. The framework then provides access to point-in-time correct features – for both – offline model training and online inference. In this talk we will describe the architecture of our system and the algorithm that makes the problem of efficient point-in-time correct feature generation, tractable.
The attendee will learn
Importance of point-in-time correct features for achieving better ML model performance
Importance of using change data capture for generating feature views
An algorithm – to efficiently generate features over change data. We use interval trees to efficiently compress time series features. The algorithm allows generating feature aggregates over this compressed representation.
A lambda architecture – that enables using the above algorithm – for online feature generation.
A framework, based on category theory, to understand how feature aggregations be distributed, and independently composed.
While the talk if fairly technical – we will introduce all the concepts from first principles with examples. Basic understanding of data-parallel distributed computation and machine learning might help, but are not required.
4. Machine Learning
● Goal: Make a prediction about the world given
incomplete data
● Labels: Prediction Target
● Features: known information to learn from
● Training output: model weights/parameters
● Serving: online feature
● Assumption: Training and serving distribution is
the same (consistency)
5. Machine Learning
● Goal: Make a prediction about the world given
incomplete data
● Labels: Prediction Target
● Features: known information to learn from
● Training output: model weights/parameters
● Serving: online feature
● Assumption: Training and serving distribution
is the same (consistency)
6. ML applications
Unstructured Structured
Image
classification
Chat apps
NLP
Object
detection
FraudCustomer LTVCredit scores Ads
Personalized search
● Most of the data is available at once: full
image
● Features are automatically extracted from few
(often one) data stream:
○ words from a text
○ pixels from an image
● Data arrives steadily as user interacts with the
platform
● Features extracted from many event streams:
○ logins
○ clicks
○ bookings
○ page views, etc
● Iterative manual feature engineering
# of data sources
8. ● Offline Batch (email marketing)
○ Does not require serving feature in
production
○ Online/Offline consistency is not a problem
● Online Real-time (personalized search)
○ Does require serving feature in production
○ Online/Offline consistency is a problem
Offline Batch vs
Online Real-time
13. Log-based training
is great †
● Easy to implement
● Any production-available data point can be used
for training and scoring
● Log can be used for audit and debug purposes
● Consistency is guaranteed
† May capture accidental data distribution shifts, requires upfront implementation of new features in production, may slow
down feature iteration cycle, prevents feature sharing between models, increases product experimentation cycle, severely
limits your ability to react to incidents, fixing production issues might degrade model performance, may decrease sleep
time during on-call rotations. Consult with your architect before taking log-based training approach.
14. The Fine Print up
close
● Sharing features is hard
● Testing new features requires production
implementation
● May capture accidental data shifts (bugs,
downed services)
● Slows down the iteration cycle
● Limits agility in reacting to production incidents
15. Slowdown of experimentation
F1
F2
F3
0 5 7
3
?
Feature values
Time
4
2 4
Label
4
L
Pred P1 P2
7
3
?
4
2
8
L L
Training data set
Userbehavior&businessprocessesProductProblem
16. ● Some models are time-dependent (seasonality)
● For some problems label maturity is on the order
of months
● Production incidents lead to dirty data in training
● Labels are scarce and expensive to acquire
→ Months-long iteration cycles
→ Hard to maintain models in production
→ Cannot address shifts in data quickly
Why is that a
problem?
17. ● Backfill features
○ Quick!
● Single feature definition for production and
training
● Automatic pipelines for training and scoring
What do we want?
19. Zipline: feature management system
Feature
Definition
Serving
Pipeline
Training
Pipeline
Model
Training Set
Online Scoring
Vector
Consistency
Fast Backfills - Data Warehouse
Low Latency Serving - Online Environment
25. Feature philosophy
● Complex features:
○ Only worth it if the gain is huge
○ Require complex computations
○ Harder to interpret
○ Harder to maintain
● Simple features:
○ Easier to maintain
○ Faster to compute
○ Cumulatively provide huge gain for the
model
26. Supported
operations
● Sum, Count
● Min, Max
● First, Last
● Last N
● Statistical moments
● Approx unique count
● Approx percentile
● Bloom filters
+ time windows for all operations!
27. Operation
requirements
● Commutative: a ⊕ b = b ⊕ a
● Associative: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
● Additional optimizations:
○ Reversible: a ⊕ ? = c
● Must be O(1) in compute ⇒ must be O(1) in space
29. Data skew: large number of events
user ts
1 2019-10-01 00:00:01
1 2019-10-01 00:00:02
... ...
1 2019-10-01 23:59:59
2 2019-10-02 15:20:30
3 2019-10-12 16:11:44
50%
Page views
Use aggregateByKey to ensure data is locally combined on the first stage before
sent final merge
32. Data skew: large number of examples
ip ts
127.0.0.1 2019-10-15 05:03:20
127.0.0.1 2019-10-15 12:32:11
127.0.0.1 2019-10-15 09:55:29
... ...
1.2.3.4 2019-10-15 03:22:21
1.2.3.5 2019-10-15 19:10:59
ip ts
127.0.0.1 2019-10-01 00:00:01
127.0.0.1 2019-10-01 00:00:02
... ...
1.2.3.4 2019-10-01 23:59:59
1.2.3.5 2019-10-02 15:20:30
1.2.3.6 2019-10-12 16:11:44
50%
Training examples Page views
33. Large number of
timestamps:
Naive solution
● Keep one aggregate per (key, driver timestamp)
● For every event:
○ Find corresponding key
○ For every driver timestamp of that key:
■ If the event occurred prior to the
timestamp produce:
● ((key, driver timestamp), data)
● Use aggregateByKey
● Problem: O(Nts
x Ne
)
35. Non-windowed case (optimized)
6
Timestamps
for one key
1 3 7 8 10 15 18 20
O(Ne
+ Nts
)
Apply to the first affected aggregate. In the end compute a cumulative sum of the values.
0 0 0 0 01 0 0Corresponding
values
9
0 0 0 0 01 0 1Corresponding
values
0 0 2 2 21 1 2Result
36. Data skew: windowed case
0 1 2 3 4 5
6
Timestamps
for one key
Window size = 5
6 7
0-2 2-3 4-5
1 3 7 8 10 15 18 20
6-7
0-3 4-7
0-7
7 8 10
2-3
4
O(Ne
x log(Nts
))
Timestamp
index
37. Feature
Sources
● Hive table produced upstream
● Jitney: Airbnb event bus
● Databases via data warehouse export and CDC