Lets talk about what we have seen as issues from our customers as issues as they try to get Spark into production.
In scope
- Focus on operational issues
- Not on building the code itself
Experience from our customer support tickets
Spark makes building a proof of concept with a subset of data relatively easy.
But then things go wrong
Plug for my talk at Hadoop Summit
Lets start with an example program in Spark.
Lets start with an example program in Spark.
The sum() call launches a job
A chunk of data somewhere
Could be on Hadoop File System (HDFS)
Could be cached in Spark
Defines the degree of parallelism
Describes a way of generating input and output partitions
Immutable – very important!
RDDs can depend on other RDDs
Most have single parent
Joins have multiple parents
Lineage over replication for fault tolerance
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Describes a way of generating input and output partitions
Immutable – very important!
RDDs can depend on other RDDs
Most have single parent
Joins have multiple parents
Lineage over replication for fault tolerance
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Describes a way of generating input and output partitions
Immutable – very important!
RDDs can depend on other RDDs
Most have single parent
Joins have multiple parents
Lineage over replication for fault tolerance
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Describes a way of generating input and output partitions
Immutable – very important!
RDDs can depend on other RDDs
Most have single parent
Joins have multiple parents
Lineage over replication for fault tolerance
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Describes a way of generating input and output partitions
Immutable – very important!
RDDs can depend on other RDDs
Most have single parent
Joins have multiple parents
Lineage over replication for fault tolerance
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Lets review the general Spark architecture
A driver
Where the DAG scheduler lives
Drives the show
Single point of failure
Executors
Communicates with driver
Runs the tasks created by the driver
Think of this as a ThreadPoolExecutor in java
Pluggable cluster managers
YARN, Mesos, standalone
In scope
- Focus on operational issues
- Not on building the code itself
Experience from our customer support tickets
Lets review the general Spark architecture
In scope
- Focus on operational issues
- Not on building the code itself
Experience from our customer support tickets
Lets review the general Spark architecture
Lets review the general Spark architecture
Lets review the general Spark architecture
In scope
- Focus on operational issues
- Not on building the code itself
Experience from our customer support tickets
Spark makes building a proof of concept with a subset of data relatively easy.
Spark makes building a proof of concept with a subset of data relatively easy.
Control plane
File distribution
Block Manager
User UI / REST API
Data-at-rest (shuffle files)