7. In Memory Data Grid
Ignite vs. Spark
!
Apache Spark is a data-analytic and ML centric system that ingest data from HDFS or another distributed file
system and performs in-memory processing of this data. Ignite is an In-Memory Data Fabric that is
data source agnostic and provides both Hadoop-like computation engine
(MapReduce) as well as many other computing paradigms like MPP, MPI, Streaming
processing. Ignite also includes first-class level support for cluster management and operations, cluster-aware
messaging and zero-deployment technologies. Ignite also provides support for full ACID transactions
spanning memory and optional data sources. Ignite is a broader in-memory system that is less
focused on Hadoop. Apache Spark is more inclined towards analytics and ML, and
focused on MR-specific payloads.
https://wiki.apache.org/incubator/IgniteProposal
8. In Memory Database
And many more!
http://en.wikipedia.org/wiki/List_of_in-memory_databases
10. What is Spark?
• Developed at UC Berkley in 2009. Open
sourced in 2010.
• Fast and general engine for large-scale data
processing
!
• Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk
!
• Multi-step Directed Acrylic Graphs (DAGs).
Many stages compared to just Hadoop Map
and Reduce only.
!
• Rich Scala, Java and Python APIs. R too!
(SparkR to be merged into 1.3.0)
!
• Interactive Shell
!
• Active development
18. Resilient Distributed Datasets (RDDs)
• Basic abstraction in Spark. Fault-tolerant collection of
elements that can be operated on in parallel
!
• RDDs can be created from local file system, HDFS,
Cassandra, HBase, Amazon S3, SequenceFiles, and
any other Hadoop InputFormat.
!
• Different levels of caching: MEMORY_ONLY,
MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc
!
• Rich APIs for Transformations and Actions
!
• Data Locality: PROCESS_LOCAL -> NODE_LOCAL ->
RACK_LOCAL
19. RDD Operations
map flatMap sortByKey
filter union reduce
sample join count
groupByKey distinct saveAsTextFile
reduceByKey mapValues first
27. Other Spark Projects (1)
• Spark Job Server
!
"Spark as a Service": Simple REST interface for all
aspects of job, context management
!
https://github.com/ooyala/spark-jobserver
!
• Hue support for Spark
!
http://gethue.com/a-new-spark-web-ui-spark-app/
28. Other Spark Projects (2)
• Tachyon
!
“Tachyon is a memory-centric distributed file system
enabling reliable file sharing at memory-speed
across cluster frameworks”
!
http://tachyon-project.org/
!
• Spark Notebook
!
“Use Apache Spark straight from the Browser”
!
https://github.com/andypetrella/spark-notebook/
35. Spark in Enterprise
Traditional Data Sources
(RDBMS, EDW, MPP, Excel, CSV)
New Sources
(Web, Logs, Social Media, IOT)!
In Memory = Fast!
Real Time Streaming
+
Analytics
+
Spark SQL
Batch Processing
HiveQL
Data Sources Data Processing Applications
BI!Tools! Enterprise!!
Applica2ons!
Custom!!
Applica2ons!