What's New in Apache Spark 2.3 & Why Should You Care

What's New in the Apache Spark™ 2.3
Jules S. Damji
BASM Meetup , Bloomberg
@2twitme
May 15, 2018

Spark Community& DeveloperAdvocate@
Databricks
Program Chair Spark + AI Summit
DeveloperAdvocate@ Hortonworks
Software engineering @: Sun Microsystems,
Netscape, @Home, VeriSign, Scalix, Centrify,
LoudCloud/Opsware, ProQuest
https://www.linkedin.com/in/dmatrix
@2twitme

Databricks’ Unified Analytics Platform
DATABRICKS RUNTIME
COLLABORATIVE WORKSPACE
Delta SQL Streaming
Powered by
Data Engineers Data Scientists
CLOUD NATIVE SERVICE
Unifies DataEngineers
and Data Scientists
Unifies Dataand AI
Technologies
Eliminates infrastructure
complexity

Major Features on Apache Spark 2.3
4
Continuous
Processing
Data Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Over 1400issues resolved!

This Talk
5
Continuous
Processing
Spark on
Kubernetes
PySpark
Performance
Streaming ML
+
ImageReader
ApacheSpark 2.3
Stream-stream
Join

This Talk
6
Continuous Processing
Stream-
Stream Join

building robust
stream processing
apps is hard

Complexities in stream processing
COMPLEX DATA
Diversedata formats
(json, avro, binary, …)
Data can bedirty,
late, out-of-order
COMPLEX SYSTEMS
Diversestoragesystems (Kafka,
S3, Kinesis, RDBMS, …)
System failures
COMPLEX WORKLOADS
Combiningstreamingwith
interactive queries
Joining two Streams
Machine learning with Streams

Structured Streaming
stream processingon Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems

Structured Streaming – Continuous Processing
Mode
10

Structured Streaming Processing Modes
11

12
Continuous
Processing
Structured Streaming – Continuous Mode

Structured Streaming
Introducedin Spark 2.0 and Production Ready Spark 2.2
Among Databricks customers:
- 10X more usagethan DStream
- Processed100+ trillion recordsin production

Continuous Processing Execution Mode
Continuous Processing (since 2.3 release) [SPARK-20928]
• A new streaming execution mode (experimental)
• Low (~1 ms) end-to-end latency
• At-least-once guarantees
Micro-batch Processing (since 2.0 release)
• Lower end-to-end latencies of ~100ms
• Exactly-once fault-tolerance guarantees
14

15
Continuous
Processing
The only change you need!
Rating/Benchmarking

Continuous Processing
Supported Operations:
• Map-like Dataset operations
• projections
• selections
• All SQL functions
• Except current_timestamp(),
current_date() and
aggregation functions
16
Supported Sources:
• Kafka source
• Rate source
Supported Sinks:
• Kafka sink
• Memory sink
• Console sink
Continuous
Processing
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Stream-Stream Joins
17
Stream-Stream
Join

Inner Join + Time constraints + Watermarks
time constraints
Time constraints
- Impressions can be 2 hours
late
- Clicks can be 3 hours late
- A click can occur within 1 hour
after the corresponding
impression
val impressionsWithWatermark = impressions
.withWatermark("impressionTime", "2 hours")
val clicksWithWatermark = clicks
.withWatermark("clickTime", "3 hours")
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
"""
))
Join
Range Join
Stream-stream
Join

Stream-stream Joins
Inner
19
Supported, optionally specify watermark on both sides + time
constraints for state cleanup
Conditionally supported, must specify watermark on right + time
constraints for correct results, optionally specify watermark on left
for all state cleanup
Conditionally supported, must specify watermark on left + time
constraints for correct results, optionally specify watermark on right
for all state cleanup
Not supported.
Left
Right
Full
Stream-stream
Join

This Talk
20
Continuous
Processing
Spark on
Kubernetes
PySpark
Performance
Streaming ML
+
ImageReader
Databricks
Delta

Streaming Machine Learning
Model transformation/prediction on batch and streaming data
with unified API.
After fitting a model or Pipeline, you can deploy it in a streaming
job.
val streamOutput = transformer.transform(streamDF)
21

Demo: Structured Streaming + MLlib
https://dbricks.co/ss_mllib_py (ss_mllib_py)

Image Support in Spark
Spark Image data source SPARK-21866 :
• Defined a standard API in Spark for loading and reading images
• Deep learning frameworks can rely on this
val df = ImageSchema.readImages("/data/images")
23

This Talk
25
Continuous
Processing
Spark on
Kubernetes
PySpark
Performance
Streaming ML
+
ImageReader
Databricks
Delta

PySpark
Introducedin Spark 0.7 (~2013); became first class citizen in the
DataFrame API in Spark 1.3 (~2015)
Much slower than Scala/Java with user-defined functions (UDF),
due to serialization & Python interpreter
Note: Most PyData tooling, e.g., Pandas, NumPy, are written in C++

PySpark Performance
Fast data serialization and execution using vectorized formats
[SPARK-22216] [SPARK-21187]
• Conversion from/to Pandas
df.toPandas()
createDataFrame(pandas_df)
• Pandas/Vectorized UDFs: UDF using Pandas to process data
• Scalar Pandas UDFs
• Grouped Map Pandas UDFs
27

Pandas UDF
• Used with functions
such as select and
withColumn.
• The Python function
should take
pandas.Seriesas
inputs and return
a pandas.Seriesof
the same length.
28
Scalar PandasUDFs:

Pandas UDF
29
Grouped Map PandasUDFs:
• Split-apply-combine
• A Python function that
defines the
computation for each
group.
• The output schema

PySpark Performance
30
Blog "Introducing Pandas UDFs for PySpark(Two Sigma)
http://dbricks.co/2rMwmW0
Apache Arrow Columnar Format forData Exchange
https://arrow.spark.org

Demo
Try to import this note book at home … :
https://dbricks.co/pandas_udf

This Talk
32
Continuous
Processing
Spark on
Kubernetes
PySpark
Performance
Streaming ML
+
ImageReader

Native Spark App in K8S
• New Spark scheduler backend
• Driver runs in a Kubernetes pod created
by the submission client and creates
pods that run the executors in response
to requests from the Spark scheduler.
[K8S-34377] [SPARK-18278]
• Make direct use of Kubernetes clusters for
multi-tenancy and sharing through
Namespaces and Quotas, as well as
administrative features such as Pluggable
Authorization, and Logging.
34

Spark on Kubernetes
Supported:
• Supports Kubernetes1.6 and up
• Supports cluster mode only
• Staticresource allocation only
• Supports Java and Scala
applications
• Can use container-local and
remote dependencies that are
downloadable
35
In roadmap (2.4):
• Client mode
• Dynamic resource allocation +
external shuffle service
• Python and R support
• Submission client local dependencies
+ Resource staging server (RSS)
• Non-secured and KerberizedHDFS
access (injection of Hadoop
configuration)

General Client Mode
https://databricks.com/blog/2018/03/06/apache-spark-2-3-with-native-kubernetes-support.html

37
Continuous
Processing
PySpark
Performance
Streaming ML
+
ImageReader
Databricks
Delta
Try these on Databricks Runtime 4.0 today!
Communityedition:
https://community.cloud.databricks.com/

Save $300 : JulesPicks https://tinyurl.com/saissf18

Thank You J
@2twitme
jules@databricks.com

What's New in Apache Spark 2.3 & Why Should You Care

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie What's New in Apache Spark 2.3 & Why Should You Care

Ähnlich wie What's New in Apache Spark 2.3 & Why Should You Care (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

What's New in Apache Spark 2.3 & Why Should You Care