The Fast Path to Building Operational Applications with Spark
1. Nikita Shamgunov, CTO and Co-founder of MemSQL
Spark Summit East | Boston | 9 February 2017
The Fast Path to Building Operational
Applications with Spark
3. ▪ Every piece of technology is scalable
▪ Analyzing data from hundreds of thousands of
machines
▪ Delivering immense value in real-time
• Real-time code deployment
• Detecting anomalies
• A/B testing results
▪ Fundamentally making the business faster by providing
data at your fingertips
An Insider’s View at Facebook
8. ▪ Scalable and elastic
• Petabyte scale
• High Concurrency
• System of record
▪ Real-time
• Operational
▪ Compatible
• ETL
• Business Intelligence
• Kafka
• Spark
MemSQL - Hybrid Cloud Data Warehouse
▪ Deployment
• Managed service in the
Cloud
• On-premises
▪ Community Edition
• Unlimited scale
• Limited high availability
and security features
9. MemSQL Confidential9
Product or Services Scores
for Operational Data
Warehouse
Critical Capabilities for Data
Warehouse and Data
Management Solutions for
Analytics
Gartner, July 2016
12. Easy Deployment of Real-Time Data Pipelines
▪ High-throughput
distributed
messaging system
▪ In-memory
execution engine
▪ Hybrid Cloud Data
Warehouse
▪ Publish and
subscribe to Kafka
“topics”
▪ High level operators
for procedural and
programmatic
analytics
▪ Full transactions and
complete durability
Amazon Kinesis
13. Use Spark and Operational Databases Together
Spark Operational Databases
Interface Programmatic Declarative
Execution Environment Job Scheduler SQL Engine and Query Optimizer
Persistent Storage Use another system Built-in
16. Operationalize Models Built in Spark
Stream and Event Processing
Extend MemSQL Analytics
Live Dashboards and Automated Reports
MemSQL and Spark Use Cases
17. Operationalize Models Built in Spark
17
Enterprise
Consumption
Data into
Spark
Model Creation Model Persistence
Results Set
CLUSTER
18. Stream and Event Processing
18
Enterprise
Consumption
Real-Time
Streaming
Data
Data
Transformation
Persistent,
Queryable Format
CLUSTER
20. Live Dashboards and Automated Reports
20
Live
Dashboards
Custom
Reporting
Access to Live
Production Data
SQL Transactions
and Analytics
CLUSTER
21. MemSQL Spark Connector via Spark Packages
The memsql-spark-connector is now available via Spark Packages:
http://spark-packages.org/
https://spark-packages.org/package/memsql/memsql-spark-connector
You can use it with any Spark command:
> $SPARK_HOME/bin/spark-shell --packages
com.memsql:memsql-connector_2.11:2.0.1
Also available on Maven
http://search.maven.org/#artifactdetails%7Ccom.memsql%7Cmemsql-connector_2.11%7C2.0.1%7Cjar
And the Github repository
https://github.com/memsql/memsql-spark-connector
23. MemSQL Confidential 23
Reducing delay in “freshness of data” from two hours to 10 minutes
+
https://www.enterprisetech.com/2016/12/09/managing-30b-bid-requests/
24. TECHNICAL BENEFITS
▪ 10x faster data refresh, from hours to minutes
▪ Run ad-hoc queries on log-level data within seconds
THE MANAGE REAL-TIME ARCHITECTURE
REAL-TIME
ANALYTICS
Real-Time
inputs
25. MemSQL Confidential25
Goldman Sachs at Kafka Summit April 2016
http://www.confluent.io/kafka-summit-2016-users-real-time-analytics-visualized-with-kafka
Real-Time Analytics Visualized w/ Kafka+Spark+MemSQL+ZoomData
27. Problem Statement
Employees have many opportunities to take advantage of their insider
knowledge and position of trust within a company. This includes:
▪ Preferential treatment to family or friends
▪ Fraud under someone else’s name
In many cases, proximity is one of the most common traits of those they
proxy their activities through.
MemSQL can quickly process the massive volume of calculations
needed to identify these relationships and iterate on new algorithms.
27
28. 28
Problem Size
Target Group
100,000
Population
50 million
X
=
Comparisons
5 trillion
Parallelize
● filters
● projections
● entity resolution
Distributed, in-memory, massively
parallel processing
From 5 trillion to 50 million
29. Rank Probabilities
Relationship
Similar entity
Comparisons
Levenshtein
SoundEx
Metaphone
On Email and Name
Geospatial filter
50 meters
Examples for Demo
29
MemSQL Duke (Spark) Results
Rank Probabilities
Relationship
Similar entity
Comparisons
Levenshtein
SoundEx
Metaphone
On Email and Name
Index filter
Last names are equal
MemSQL Duke (Spark) Results
Example 1
Example 2
31. Cluster size: 8 machines, c4.8xlarge, 36 cores, 60 GB
RAM
• 2 leaf nodes per machine, each with 9 partitions
• this gives us ~2 cores per partition in the cluster - one core is
going to be at 100% CPU during the computation, the other is
used for Spark + Duke + Misc
Cluster Size
31
32. 32
Conclusion
▪ Speed in covering massive search space
• In memory (On commodity hardware)
• Parallelization
▪ Scales linearly
▪ Huge value in running all of this natively in MemSQL
33. ▪ Push down the in-memory, proximity filter to each of the
leaves
▪ Leverage indexes
▪ Stream results in parallel to Duke Entity Resolution
How does MemSQL do it?
33
34. ▪ Using Metaphone, SoundEx, and Levenshtein
algorithms to compare first name, last name and email
▪ Duke supports many more comparisons, and makes it
very easy to create new ones
▪ With a training dataset, Duke can use a genetic
algorithm to optimize comparator weights
▪ https://github.com/larsga/Duke
Duke Entity Resolution
34