Spark is an open source cluster computing framework that can outperform Hadoop by 30x through a combination of in-memory computation and a richer execution engine. Shark is a port of Apache Hive onto Spark, which provides a similar speedup for SQL queries, allowing interactive exploration of data in existing Hive warehouses. This talk will cover how both Spark and Shark are being used at various companies to accelerate big data analytics, the architecture of the systems, and where they are heading. We will also discuss the next major feature we are developing, Spark Streaming, which adds support for low-latency stream processing to Spark, giving users a unified interface for batch and real-time analytics.
Long journey of Ruby standard library at RubyConf AU 2024
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
1. Spark and Shark
High-Speed In-Memory Analytics
over Hadoop and Hive Data
Matei Zaharia, in collaboration with
Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff
Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin
Ma, Murphy McCauley, Scott Shenker, Ion Stoica, Reynold Xin
UC Berkeley
spark-project.org UC BERKELEY
2. What is Spark?
Not a modified version of Hadoop
Separate, fast, MapReduce-like engine
» In-memory data storage for very fast iterative queries
» General execution graphs and powerful optimizations
» Up to 40x faster than Hadoop
Compatible with Hadoop’s storage APIs
» Can read/write to any Hadoop-supported system,
including HDFS, HBase, SequenceFiles, etc
3. What is Shark?
Port of Apache Hive to run on Spark
Compatible with existing Hive data, metastores,
and queries (HiveQL, UDFs, etc)
Similar speedups of up to 40x
4. Project History
Spark project started in 2009, open sourced 2010
Shark started summer 2011, alpha April 2012
In use at Berkeley, Princeton, Klout, Foursquare,
Conviva, Quantifind, Yahoo! Research & others
250+ member meetup, 500+ watchers on GitHub
6. Why a New Programming Model?
MapReduce greatly simplified big data analysis
But as soon as it got popular, users wanted more:
» More complex, multi-stage applications (graph
algorithms, machine learning)
» More interactive ad-hoc queries
» More real-time online processing
All three of these apps require fast data sharing
across parallel jobs
7. Data Sharing in MapReduce
HDFS HDFS HDFS HDFS
read write read write
iter. 1 iter. 2 . . .
Input
HDFS query 1 result 1
read
query 2 result 2
query 3 result 3
Input
. . .
Slow due to replication, serialization, and disk IO
8. Data Sharing in Spark
iter. 1 iter. 2 . . .
Input
query 1
one-time
processing
query 2
query 3
Input Distributed
memory . . .
10-100× faster than network and disk
9. Spark Programming Model
Key idea: resilient distributed datasets (RDDs)
» Distributed collections of objects that can be cached
in memory across cluster nodes
» Manipulated through various parallel operators
» Automatically rebuilt on failure
Interface
» Clean language-integrated API in Scala
» Can be used interactively from Scala console
10. Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
BaseTransformed RDD
RDD Cache 1
lines = spark.textFile(“hdfs://...”) Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(„t‟)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()
Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Worker Block 2
Result: scaled tosearch of Wikipedia
full-text 1 TB data in 5-7 sec
in <1 sec (vs 20 for on-disk data)
(vs 170 sec sec for on-disk data) Block 3
11. Fault Tolerance
RDDs track the series of transformations used to
build them (their lineage) to recompute lost data
E.g: messages = textFile(...).filter(_.contains(“error”))
.map(_.split(„t‟)(2))
HadoopRDD FilteredRDD MappedRDD
path = hdfs://… func = _.contains(...) func = _.split(…)
12. Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D) Load data in memory once
Initial parameter vector
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient Repeated MapReduce steps
} to do gradient descent
println("Final w: " + w)
13. Logistic Regression Performance
4500
4000
3500 127 s / iteration
Running Time (s)
3000
2500
Hadoop
2000
1500 Spark
1000
500
0 first iteration 174 s
further iterations 6 s
1 5 10 20 30
Number of Iterations
14. Supported Operators
map reduce sample
filter count cogroup
groupBy reduceByKey take
sort groupByKey partitionBy
join first pipe
leftOuterJoin union save
rightOuterJoin cross ...
15. Other Engine Features
General graphs of operators ( efficiency)
A: B:
G:
Stage 1 groupBy
C: D: F:
map = RDD
E: join
= cached data
Stage 2 union Stage 3
16. Other Engine Features
Controllable data partitioning to minimize
communication
PageRank Performance
200 171 Hadoop
Iteration time (s)
150
Basic Spark
100 72
Spark + Controlled
50 23 Partitioning
0
19. Applications
In-memory analytics & anomaly detection (Conviva)
Interactive queries on data streams (Quantifind)
Exploratory log analysis (Foursquare)
Traffic estimation w/ GPS data (Mobile Millennium)
Twitter spam classification (Monarch)
...
20. Conviva GeoReport
Hive 20
Spark 0.5
Time (hours)
0 5 10 15 20
Group aggregations on many keys w/ same filter
40× gain over Hive from avoiding repeated
reading, deserialization and filtering
21. Quantifind Feed Analysis
Parsed Extracted In-Memory
Data Feeds Web
Documents Entities Time Series
Spark
App
queries
Load data feeds, extract entities, and compute
in-memory tables every few minutes
Let users drill down interactively from AJAX app
22. Mobile Millennium Project
Estimate city traffic from crowdsourced GPS data
Iterative EM algorithm
scaling to 160 nodes
Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu
24. Motivation
Hive is great, but Hadoop’s execution engine
makes even the smallest queries take minutes
Scala is good for programmers, but many data
users only know SQL
Can we extend Hive to run on Spark?
25. Hive Architecture
Client CLI JDBC
Driver
Meta Physical Plan
store SQL Query
Parser Optimizer Execution
MapReduce
HDFS
26. Shark Architecture
Client CLI JDBC
Driver Cache Mgr.
Meta Physical Plan
store SQL Query
Parser Optimizer Execution
Spark
HDFS
[Engle et al, SIGMOD 2012]
27. Efficient In-Memory Storage
Simply caching Hive records as Java objects is
inefficient due to high per-object overhead
Instead, Shark employs column-oriented
storage using arrays of primitive types
Row Storage Column Storage
1 john 4.1 1 2 3
2 mike 3.5 john mike sally
3 sally 6.4 4.1 3.5 6.4
28. Efficient In-Memory Storage
Simply caching Hive records as Java objects is
inefficient due to high per-object overhead
Instead, Shark employs column-oriented
storage using arrays of primitive types
Row Storage Column Storage
1 john 4.1 1 2 3
Benefit: similarly compact size to serialized data,
2 but >5x faster to access sally
mike 3.5 john mike
3 sally 6.4 4.1 3.5 6.4
29. Using Shark
CREATE TABLE mydata_cached AS SELECT …
Run standard HiveQL on it, including UDFs
» A few esoteric features are not yet supported
Can also call from Scala to mix with Spark
Early alpha release at shark.cs.berkeley.edu
30. Benchmark Query 1
SELECT * FROM grep WHERE field LIKE „%XYZ%‟;
Shark (cached) 12s
Shark 182s
Hive 207s
0 50 100 150 200 250
Execution Time (secs)
31. Benchmark Query 2
SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings
FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL
WHERE V.visitDate BETWEEN „1999-01-01‟ AND „2000-01-01‟
GROUP BY V.sourceIP
ORDER BY earnings DESC
LIMIT 1;
Shark (cached) 126s
Shark 270s
Hive 447s
0 100 200 300 400 500
Execution Time (secs)
34. Streaming Spark
Many key big data apps must run in real time
» Live event reporting, click analysis, spam filtering, …
Event-passing systems (e.g. Storm) are low-level
» Users must worry about FT, state, consistency
» Programming model is different from batch, so must
write each app twice
Can we give streaming a Spark-like interface?
35. Our Idea
Run streaming computations as a series of very
short (<1 second) batch jobs
» “Discretized stream processing”
Keep state in memory as RDDs (automatically
recover from any failure)
Provide a functional API similar to Spark
36. Spark Streaming API
Functional operators on discretized streams
New “stateful” operators for windowing
pageViews ones counts
pageViews = readStream("...", "1s")
t = 1:
ones = pageViews.map(ev => (ev.url, 1))
map reduce
counts = ones.runningReduce(_ + _)
t = 2:
D-streams Transformation
sliding = ones.reduceByWindow(
“5s”, _ + _, _ - _) ...
= RDD = partition
Sliding window reduce with
“add” and “subtract” functions
37. Streaming + Batch + Ad-Hoc
Combining D-streams with historical data:
pageViews.join(historicCounts).map(...)
Interactive ad-hoc queries on stream state:
counts.slice(“21:00”, “21:05”).topK(10)
38. How Fast Can It Go?
Can process 4 GB/s (42M records/s) of data on 100
nodes at sub-second latency
Recovers from failures within 1 sec
5 5
Grep TopKWords
Cluster Throughput
4 4
3 3
(GB/s)
2 2
1 1
0 1 sec 2 sec 0
0 50 100 0 50 100
Maximum throughput possible with 1s or 2s latency
39. Performance vs Storm
Spark Storm Spark Storm
60 30
Grep Throughput
TopK Throughput
(MB/s/node)
(MB/s/node)
40 20
20 10
0 0
10000 100 10000 100
Record Size (bytes) Record Size (bytes)
Storm limited to 10,000 records/s/node
Also tried S4: 7000 records/s/node
41. Conclusion
Spark & Shark speed up your interactive, complex,
and (soon) streaming analytics on Hadoop data
Download and docs: www.spark-project.org
» Easy local mode and deploy scripts for EC2
User meetup: meetup.com/spark-users
Training camp at Berkeley in August!
matei@berkeley.edu / @matei_zaharia
42. Behavior with Not Enough RAM
100
68.8
Iteration time (s)
58.1
80
40.7
60
29.7
40
11.5
20
0
Cache 25% 50% 75% Fully
disabled cached
% of working set in memory
43. Software Stack
Shark Bagel Streaming
(Hive on Spark) (Pregel on Spark) Spark
…
Spark
Local Apache
EC2 YARN
mode Mesos
Hinweis der Redaktion
Each iteration is, for example, a MapReduce job
Add “variables” to the “functions” in functional programming
Key idea: add “variables” to the “functions” in functional programming
This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
This will show interactive search on 50 GB of Wikipedia data
This will show interactive search on 50 GB of Wikipedia data
Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join
This will show interactive search on 50 GB of Wikipedia data
This will show interactive search on 50 GB of Wikipedia data
Streaming Spark offers similar speed while providing FT and consistency guarantees that these systems lack