Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building a Versatile Analytics Pipeline
On Top Of Apache Spark
Misha Chernetsov, Grammarly
Spark Summit 2017
June 6, 2017
Data Team Lead @ Grammarly
Building Analytics Pipelines (5 years)
Coding on JVM (12 years), Scala + Spark (3 years)
About ...
Tool that helps us better understand:
● Who are our users?
● How do they interact with the product?
● How do they get in, ...
We want our decisions to be
data-driven
Everyone: product managers, marketing, engineers, support...
Analytics @ Consumer ...
data
analytics
report
Analytics @ Consumer Product Company
Calendar Day
Number of
unique active
users by day
Example Report 1 – Daily Active Users
dummy data!dummy data!
Example Report 2 – Comparison of Cohort Retention Over Time
dummy data!dummy data!
Ads
Email
Social
Number of
users who
bought a
subscription.
Split by traffic
source type
(where user
came from)
Calendar D...
● Landing page visit
○ URL with UTM tags
○ Referrer
● Subscription purchased
○ Is first in subscription
Example: Data
Everything is an Event
Example: Data
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
…
}
{
"eventName": "subscrib...
Enrich and/or Join
Example: Data
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
…
}
{
"eventName": "subscribe",
...
capture enrich index query
Analytics @ Consumer Product Company
capture enrich index query
Use 3rd Party?
1. Integrated Event Analytics
2. UI over your DB
Reports are not tailored for yo...
capture enrich index query
Build Step 1: Capture
● Always up, resilient
● Spikes / back pressure
● Buffer for delayed processing
Kafka
Capture
REST
{
"eventName": "page-vi...
Long-term
Storage
StreamKafka
Save To Long-Term Storage
Cassandra
micro-batch
Kafka
Save To Long-Term Storage
val rdd = KafkaUtils.createRDD[K, V](...)
rdd.saveToCassandra("raw")
capture enrich index query
Build Step 2: Enrich
Enrichment 1: User Attribution
Enrichment 1: User Attribution
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
…
}
{
"event...
Enrichment 1: User Attribution
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
…
}
{
"event...
t
Non-authenticated:
userId = null
Authenticated:
userId = 123
fingerprint = abc
(All events from a
given browser)
Enrichm...
t
Non-authenticated:
userId = null
attributedUserId = 123
Authenticated:
userId = 123
fingerprint = abc
(All events from a...
Authenticated:
userId = 756
Enrichment 1: User Attribution
tfingerprint = abc
(All events from a
given browser)
Authentica...
Authenticated:
userId = 756
Enrichment 1: User Attribution
tfingerprint = abc
(All events from a
given browser)
Authentica...
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.appen...
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.appen...
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.appen...
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.appen...
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.appen...
By default we
should operate
in User Memory
(small fraction).
Spark & Memory
User Memory
100% - spark.memory.fraction = 25...
rdd.mapPartitions { iterator =>
val buffer = new SpillableBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.a...
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory
×2
Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory
Spill to Disk
Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spill to Disk
Spark Memory Manager & Spillable Collection
Memory Disk
trait SizeTracker {
def afterUpdate(): Unit = { … }
def estimateSize(): Long = { … }
}
Call on every append.
Periodically ...
trait Spillable {
abstract def spill(inMemCollection: C): Unit
def maybeSpill(currentMemory: Long, inMemCollection: C) {
t...
public long acquireExecutionMemory(long required, …)
public void releaseExecutionMemory(long size, …)
TaskMemoryManager
● Be safe with outliers
● Get outside User Memory (25%), use Spark Memory (75%)
● Spark APIs: Could be a bit friendlier an...
Enrichment 2: Calculable Props
Enrichment Phase 2: Calculable Props
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
"attri...
Enrichment Phase 2: Calculable Props
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
"attri...
val firstUtmMedium: CalcProp[String] =
(E  "url").as[Url]
.map(_.param("utm_source"))
.forEvent("page-visit")
.first
Enric...
● Type-safe, functional, composable
● Familiar: similar to Scala collections API
● Batch & Stream (incremental)
Enrichment...
Enrichment Pipeline with Spark
Raw
Kafka
Spark Pipeline
Stream:
Save Raw Kafka
Stream:
User Attr.
User-attributed
Kafka
Stream:
Calc Props
Enriched and
Q...
batch
micro-batchmicro-batchmicro-batch
Cassandra
Kafka
Spark Pipeline
Kafka
Cassandra
Kafka
Parquet on
AWS S3
batch
● Connectors for everything
● Great for batch
○ Shuffle with spilling
○ Failure recovery
● Great for streaming
○ Fast
○ Lo...
batch
micro-batchmicro-batchmicro-batch
Cassandra
Kafka
Spark Pipeline
Kafka
Cassandra
Kafka
Parquet on
AWS S3
batch
job
Multiple Output Destinations
Kafka Kafka
CassandraCassandra
val rdd: RDD[T]
rdd.sendToKafka(“topic_x”)
rdd.saveToCassandra(“table_foo”)
rdd.saveToCassandra(“table_bar”)
Multiple Outp...
rdd.saveToCassandra(...)
rdd.forEachPartition(...)
sc.runJob(...)
Multiple Output Destinations: Try 1
job
Multiple Output Destinations: Try 1
Kafka Kafka
CassandraCassandra
job
Multiple Output Destinations: Try 1 = 3 Jobs
Kafka Kafka
job
Kafka
Table 1
job
Kafka
Table 2
val rdd: RDD[T]
rdd.cache()
rdd.sendToKafka(“topic_x”)
rdd.saveToCassandra(“table_foo”)
rdd.saveToCassandra(“table_bar”)
M...
job
Multiple Output Destinations: Try 2 = Read Once, 3 Jobs
Kafka Kafka
job
Cache
Table 1
job
Cache
Table 2
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream())
)
iterator...
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream())
)
iterator...
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream())
)
iterator...
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream(...))
)
itera...
andWriteToX = rdd.mapPartitions { iterator =>
val writer = new XWriter()
val writingIterator = iterator.map { el =>
writer...
andWriteToX = rdd.mapPartitions { iterator =>
val writer = new XWriter()
val writingIterator = iterator.map { el =>
writer...
andWriteToX = rdd.mapPartitions { iterator =>
val writer = new XWriter()
val writingIterator = iterator.map { el =>
writer...
val rdd: RDD[T]
rdd.andSaveToCassandra(“table_foo”)
.andSaveToCassandra(“table_bar”)
.sendToKafka(“topic_x”)
Multiple Outp...
job
Multiple Output Destinations: Try 3
Kafka Kafka
CassandraCassandra
● Kafka
● Cassandra
● HDFS
Important! Each andWriter will consume resources
● Memory (buffers)
● IO
And Writer
capture enrich index query
Build Step 3: Index
Index
● Parquet on AWS S3
● Custom partitioning: By eventName and time interval
● Append changes, compact + merge on the f...
Some Stats
● Thousands of
events
per second
● Terabytes of
compressed
data
capture enrich index query
Build Step 4: Query
● DataFrames
● Spark SQL Scala dsl / Pure SQL
● Zeppelin
Hardcore Query
● Plot by day
● Unique visitors
● Filter by country
● Split by traffic source (top 20)
● Time from 2 weeks ago to today
Ca...
Option 1: SQL
Quickly gets complex
Too expensive
to build, extend
and support
Option 2: UI
SEGMENT “eventName”
WHERE foo = “bar” AND x.y IN (“a”, “b”, “c”)
UNIQUE
BY m IS NOT NULL
TIME from 2 months ago to today
S...
SEGMENT “eventName”
WHERE foo = “bar” AND x.y IN (“a”, “b”, “c”)
UNIQUE
BY m IS NOT NULL
TIME from 2 months ago to today
S...
● Segment, Funnel, Retention
● UI & as DataFrame in Zeppelin
● Spark <= 1.6 – Scala Parser Combinators
● Reuse most comple...
Option 3: Custom Query Language
● Custom versatile analytics is doable and enjoyable
● Spark is a great platform to build analytics on top of
○ Enrichment...
We are hiring!
olivia@grammarly.com
https://www.grammarly.com/jobs
Thank you!
Questions?
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov
Upcoming SlideShare
Loading in …5
×

of

Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 1 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 2 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 3 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 4 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 5 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 6 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 7 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 8 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 9 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 10 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 11 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 12 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 13 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 14 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 15 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 16 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 17 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 18 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 19 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 20 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 21 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 22 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 23 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 24 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 25 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 26 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 27 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 28 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 29 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 30 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 31 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 32 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 33 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 34 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 35 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 36 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 37 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 38 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 39 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 40 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 41 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 42 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 43 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 44 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 45 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 46 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 47 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 48 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 49 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 50 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 51 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 52 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 53 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 54 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 55 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 56 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 57 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 58 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 59 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 60 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 61 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 62 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 63 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 64 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 65 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 66 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 67 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 68 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 69 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 70 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 71 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 72 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 73 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 74 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 75 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 76 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 77 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 78 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 79 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 80 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 81 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 82 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 83 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 84 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 85 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 86 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 87 Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov Slide 88
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

5 Likes

Share

Download to read offline

Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov

Download to read offline

It is common for consumer Internet companies to start off with popular third-party tools for analytics needs. Then, when the user base and the company grows, they end up building their own analytics data pipeline and query engine to cope with their data scale, satisfy custom data enrichment and reporting needs and achieve high quality of their data. That’s exactly the path that was taken at Grammarly, the popular online proofreading service.

In this session, Grammarly will share how they improved business and marketing analytics, previously done with Mixpanel, by building their own in-house analytics engine and application on top of Apache Spark. Chernetsov wil touch upon several Spark tweaks and gotchas that they experienced along the way:

– Outputting data to several storages in a single Spark job
– Dealing with Spark memory model, building a custom spillable data-structure for your data traversal
– Implementing a custom query language with parser combinators on top of Spark sql parser
– Custom query optimizer and analyzer when you want not exactly sql
– Flexible-schema storage and query against multi-schema data with schema conflicts
– Custom aggregation functions in Spark SQL

Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov

  1. 1. Building a Versatile Analytics Pipeline On Top Of Apache Spark Misha Chernetsov, Grammarly Spark Summit 2017 June 6, 2017
  2. 2. Data Team Lead @ Grammarly Building Analytics Pipelines (5 years) Coding on JVM (12 years), Scala + Spark (3 years) About Me: Misha Chernetsov @chernetsov
  3. 3. Tool that helps us better understand: ● Who are our users? ● How do they interact with the product? ● How do they get in, engage, pay, and how long do they stay? Analytics @ Consumer Product Company
  4. 4. We want our decisions to be data-driven Everyone: product managers, marketing, engineers, support... Analytics @ Consumer Product Company
  5. 5. data analytics report Analytics @ Consumer Product Company
  6. 6. Calendar Day Number of unique active users by day Example Report 1 – Daily Active Users dummy data!dummy data!
  7. 7. Example Report 2 – Comparison of Cohort Retention Over Time dummy data!dummy data!
  8. 8. Ads Email Social Number of users who bought a subscription. Split by traffic source type (where user came from) Calendar Day Example Report 3 – Payer Conversions By Traffic Source dummy data!dummy data!
  9. 9. ● Landing page visit ○ URL with UTM tags ○ Referrer ● Subscription purchased ○ Is first in subscription Example: Data
  10. 10. Everything is an Event Example: Data { "eventName": "page-visit", "url": "...?utm_medium=ad", … } { "eventName": "subscribe", "period": "12 months", … }
  11. 11. Enrich and/or Join Example: Data { "eventName": "page-visit", "url": "...?utm_medium=ad", … } { "eventName": "subscribe", "period": "12 months", … } Slice by Plot
  12. 12. capture enrich index query Analytics @ Consumer Product Company
  13. 13. capture enrich index query Use 3rd Party? 1. Integrated Event Analytics 2. UI over your DB Reports are not tailored for your needs, limited capability. Pre-aggregation / enriching is still on you. Hard to achieve accuracy and trust.
  14. 14. capture enrich index query Build Step 1: Capture
  15. 15. ● Always up, resilient ● Spikes / back pressure ● Buffer for delayed processing Kafka Capture REST { "eventName": "page-visit", "url": "...?utm_medium=paid", … }
  16. 16. Long-term Storage StreamKafka Save To Long-Term Storage
  17. 17. Cassandra micro-batch Kafka Save To Long-Term Storage val rdd = KafkaUtils.createRDD[K, V](...) rdd.saveToCassandra("raw")
  18. 18. capture enrich index query Build Step 2: Enrich
  19. 19. Enrichment 1: User Attribution
  20. 20. Enrichment 1: User Attribution { "eventName": "page-visit", "url": "...?utm_medium=ad", "fingerprint": "abc", … } { "eventName": "subscribe", "userId": 123, "fingerprint": "abc", … }
  21. 21. Enrichment 1: User Attribution { "eventName": "page-visit", "url": "...?utm_medium=ad", "fingerprint": "abc", … } { "eventName": "subscribe", "userId": 123, "fingerprint": "abc", … } "attributedUserId": 123,
  22. 22. t Non-authenticated: userId = null Authenticated: userId = 123 fingerprint = abc (All events from a given browser) Enrichment 1: User Attribution
  23. 23. t Non-authenticated: userId = null attributedUserId = 123 Authenticated: userId = 123 fingerprint = abc (All events from a given browser) Enrichment 1: User Attribution
  24. 24. Authenticated: userId = 756 Enrichment 1: User Attribution tfingerprint = abc (All events from a given browser) Authenticated: userId = 123 Heuristics to attribute those
  25. 25. Authenticated: userId = 756 Enrichment 1: User Attribution tfingerprint = abc (All events from a given browser) Authenticated: userId = 123 Heuristics to attribute those
  26. 26. rdd.mapPartitions { iterator => val buffer = new ArrayBuffer() iterator .takeWhile(_.userId.isEmpty) .foreach(buffer.append) val userId = iterator.head.userId buffer.map(_.setAttributedUserId(userId)) ++ iterator } Enrichment 1: User Attribution
  27. 27. rdd.mapPartitions { iterator => val buffer = new ArrayBuffer() iterator .takeWhile(_.userId.isEmpty) .foreach(buffer.append) val userId = iterator.head.userId buffer.map(_.setAttributedUserId(userId)) ++ iterator } Enrichment 1: User Attribution
  28. 28. rdd.mapPartitions { iterator => val buffer = new ArrayBuffer() iterator .takeWhile(_.userId.isEmpty) .foreach(buffer.append) val userId = iterator.head.userId buffer.map(_.setAttributedUserId(userId)) ++ iterator } Enrichment 1: User Attribution
  29. 29. rdd.mapPartitions { iterator => val buffer = new ArrayBuffer() iterator .takeWhile(_.userId.isEmpty) .foreach(buffer.append) val userId = iterator.head.userId buffer.map(_.setAttributedUserId(userId)) ++ iterator } Enrichment 1: User Attribution
  30. 30. rdd.mapPartitions { iterator => val buffer = new ArrayBuffer() iterator .takeWhile(_.userId.isEmpty) .foreach(buffer.append) val userId = iterator.head.userId buffer.map(_.setAttributedUserId(userId)) ++ iterator } Enrichment 1: User Attribution Can grow big and OOM your worker for outliers who use Grammarly without ever registering
  31. 31. By default we should operate in User Memory (small fraction). Spark & Memory User Memory 100% - spark.memory.fraction = 25% Spark Memory spark.memory.fraction = 75% Let’s get into Spark Memory and use its safety features.
  32. 32. rdd.mapPartitions { iterator => val buffer = new SpillableBuffer() iterator .takeWhile(_.userId.isEmpty) .foreach(buffer.append) val userId = iterator.head.userId buffer.map(_.setAttributedUserId(userId)) ++ iterator } Enrichment 1: User Attribution Can safely grow in mem while enough free Spark Mem. Spills to disk otherwise.
  33. 33. Spark Memory Manager & Spillable Collection Memory Disk
  34. 34. Spark Memory Manager & Spillable Collection Memory Disk
  35. 35. Spark Memory Manager & Spillable Collection Memory Disk
  36. 36. Spark Memory Manager & Spillable Collection Memory ×2 Disk
  37. 37. Spark Memory Manager & Spillable Collection Memory Disk
  38. 38. Spark Memory Manager & Spillable Collection Memory Disk
  39. 39. Spark Memory Manager & Spillable Collection Memory Spill to Disk Disk
  40. 40. Spark Memory Manager & Spillable Collection Memory Disk Spill to Disk
  41. 41. Spark Memory Manager & Spillable Collection Memory Disk
  42. 42. trait SizeTracker { def afterUpdate(): Unit = { … } def estimateSize(): Long = { … } } Call on every append. Periodically estimates size and saves samples. Extrapolates SizeTracker
  43. 43. trait Spillable { abstract def spill(inMemCollection: C): Unit def maybeSpill(currentMemory: Long, inMemCollection: C) { try x2 if needed } } Spillable call on every append to collection
  44. 44. public long acquireExecutionMemory(long required, …) public void releaseExecutionMemory(long size, …) TaskMemoryManager
  45. 45. ● Be safe with outliers ● Get outside User Memory (25%), use Spark Memory (75%) ● Spark APIs: Could be a bit friendlier and high level Custom Spillable Collection
  46. 46. Enrichment 2: Calculable Props
  47. 47. Enrichment Phase 2: Calculable Props { "eventName": "page-visit", "url": "...?utm_medium=ad", "fingerprint": "abc", "attributedUserId": 123, … } { "eventName": "subscribe", "userId": 123, "fingerprint": "abc", … }
  48. 48. Enrichment Phase 2: Calculable Props { "eventName": "page-visit", "url": "...?utm_medium=ad", "fingerprint": "abc", "attributedUserId": 123, … } { "eventName": "subscribe", "userId": 123, "fingerprint": "abc", "firstUtmMedium": "ad", … }
  49. 49. val firstUtmMedium: CalcProp[String] = (E "url").as[Url] .map(_.param("utm_source")) .forEvent("page-visit") .first Enrichment Phase 2: Calculable Props Engine & DSL
  50. 50. ● Type-safe, functional, composable ● Familiar: similar to Scala collections API ● Batch & Stream (incremental) Enrichment Phase 2: Calculable Props Engine & DSL
  51. 51. Enrichment Pipeline with Spark
  52. 52. Raw Kafka Spark Pipeline Stream: Save Raw Kafka Stream: User Attr. User-attributed Kafka Stream: Calc Props Enriched and Queryable Batch: User Attr. Batch: Calc Props
  53. 53. batch micro-batchmicro-batchmicro-batch Cassandra Kafka Spark Pipeline Kafka Cassandra Kafka Parquet on AWS S3 batch
  54. 54. ● Connectors for everything ● Great for batch ○ Shuffle with spilling ○ Failure recovery ● Great for streaming ○ Fast ○ Low overhead Spark Pipeline
  55. 55. batch micro-batchmicro-batchmicro-batch Cassandra Kafka Spark Pipeline Kafka Cassandra Kafka Parquet on AWS S3 batch
  56. 56. job Multiple Output Destinations Kafka Kafka CassandraCassandra
  57. 57. val rdd: RDD[T] rdd.sendToKafka(“topic_x”) rdd.saveToCassandra(“table_foo”) rdd.saveToCassandra(“table_bar”) Multiple Output Destinations: Try 1
  58. 58. rdd.saveToCassandra(...) rdd.forEachPartition(...) sc.runJob(...) Multiple Output Destinations: Try 1
  59. 59. job Multiple Output Destinations: Try 1 Kafka Kafka CassandraCassandra
  60. 60. job Multiple Output Destinations: Try 1 = 3 Jobs Kafka Kafka job Kafka Table 1 job Kafka Table 2
  61. 61. val rdd: RDD[T] rdd.cache() rdd.sendToKafka(“topic_x”) rdd.saveToCassandra(“table_foo”) rdd.saveToCassandra(“table_bar”) Multiple Output Destinations: Try 2
  62. 62. job Multiple Output Destinations: Try 2 = Read Once, 3 Jobs Kafka Kafka job Cache Table 1 job Cache Table 2
  63. 63. rdd.forEachPartition { iterator => val writer = new BufferedWriter( new OutputStreamWriter(new FileOutStream()) ) iterator.forEach { el => writer.writeln(el) } writer.close() // makes sure this writes } Writer
  64. 64. rdd.forEachPartition { iterator => val writer = new BufferedWriter( new OutputStreamWriter(new FileOutStream()) ) iterator.forEach { el => writer.writeln(el) } writer.close() // makes sure this writes } Writer
  65. 65. rdd.forEachPartition { iterator => val writer = new BufferedWriter( new OutputStreamWriter(new FileOutStream()) ) iterator.forEach { el => writer.writeln(el) } writer.close() // makes sure this writes } Writer
  66. 66. rdd.forEachPartition { iterator => val writer = new BufferedWriter( new OutputStreamWriter(new FileOutStream(...)) ) iterator.forEach { el => writer.writeln(el) } writer.close() // makes sure this writes } ● Buffer ● Non-blocking ● Idempotent / Dedupe Writer
  67. 67. andWriteToX = rdd.mapPartitions { iterator => val writer = new XWriter() val writingIterator = iterator.map { el => writer.write(el) }.closing(() => writer.close) } AndWriter
  68. 68. andWriteToX = rdd.mapPartitions { iterator => val writer = new XWriter() val writingIterator = iterator.map { el => writer.write(el) }.closing(() => writer.close) } AndWriter
  69. 69. andWriteToX = rdd.mapPartitions { iterator => val writer = new XWriter() val writingIterator = iterator.map { el => writer.write(el) }.closing(() => writer.close) } AndWriter
  70. 70. val rdd: RDD[T] rdd.andSaveToCassandra(“table_foo”) .andSaveToCassandra(“table_bar”) .sendToKafka(“topic_x”) Multiple Output Destinations: Try 3
  71. 71. job Multiple Output Destinations: Try 3 Kafka Kafka CassandraCassandra
  72. 72. ● Kafka ● Cassandra ● HDFS Important! Each andWriter will consume resources ● Memory (buffers) ● IO And Writer
  73. 73. capture enrich index query Build Step 3: Index
  74. 74. Index ● Parquet on AWS S3 ● Custom partitioning: By eventName and time interval ● Append changes, compact + merge on the fly when querying ● Randomized names to maximize S3 parallelism ● Use s3a for max performance and and tweak for S3 read-after-write consistency ● Support flexible schema, even with conflicts!
  75. 75. Some Stats ● Thousands of events per second ● Terabytes of compressed data
  76. 76. capture enrich index query Build Step 4: Query
  77. 77. ● DataFrames ● Spark SQL Scala dsl / Pure SQL ● Zeppelin Hardcore Query
  78. 78. ● Plot by day ● Unique visitors ● Filter by country ● Split by traffic source (top 20) ● Time from 2 weeks ago to today Casual Query
  79. 79. Option 1: SQL Quickly gets complex
  80. 80. Too expensive to build, extend and support Option 2: UI
  81. 81. SEGMENT “eventName” WHERE foo = “bar” AND x.y IN (“a”, “b”, “c”) UNIQUE BY m IS NOT NULL TIME from 2 months ago to today STEP 1 month SPAN 1 week Option 3: Custom Query Language
  82. 82. SEGMENT “eventName” WHERE foo = “bar” AND x.y IN (“a”, “b”, “c”) UNIQUE BY m IS NOT NULL TIME from 2 months ago to today STEP 1 month SPAN 1 week Option 3: Custom Query Language Expressions
  83. 83. ● Segment, Funnel, Retention ● UI & as DataFrame in Zeppelin ● Spark <= 1.6 – Scala Parser Combinators ● Reuse most complex part of expression parser ● Relatively extensible Option 3: Custom Query Language
  84. 84. Option 3: Custom Query Language
  85. 85. ● Custom versatile analytics is doable and enjoyable ● Spark is a great platform to build analytics on top of ○ Enrichment Pipeline: Batch / Streaming, Query, ML ● Would be cool to see even deep internals slightly more extensible Conclusion
  86. 86. We are hiring! olivia@grammarly.com https://www.grammarly.com/jobs
  87. 87. Thank you! Questions?
  • MaheBayireddi

    Nov. 1, 2020
  • bebeting

    Jan. 16, 2019
  • HYDN

    Jun. 1, 2018
  • AnuragSharma312

    Mar. 24, 2018
  • edyg

    Aug. 1, 2017

It is common for consumer Internet companies to start off with popular third-party tools for analytics needs. Then, when the user base and the company grows, they end up building their own analytics data pipeline and query engine to cope with their data scale, satisfy custom data enrichment and reporting needs and achieve high quality of their data. That’s exactly the path that was taken at Grammarly, the popular online proofreading service. In this session, Grammarly will share how they improved business and marketing analytics, previously done with Mixpanel, by building their own in-house analytics engine and application on top of Apache Spark. Chernetsov wil touch upon several Spark tweaks and gotchas that they experienced along the way: – Outputting data to several storages in a single Spark job – Dealing with Spark memory model, building a custom spillable data-structure for your data traversal – Implementing a custom query language with parser combinators on top of Spark sql parser – Custom query optimizer and analyzer when you want not exactly sql – Flexible-schema storage and query against multi-schema data with schema conflicts – Custom aggregation functions in Spark SQL

Views

Total views

5,094

On Slideshare

0

From embeds

0

Number of embeds

3,106

Actions

Downloads

60

Shares

0

Comments

0

Likes

5

×