Axa Assurance Maroc - Insurer Innovation Award 2024
Â
Hadoop to spark_v2
1. From Hadoop to Spark
Introduction
Hadoop and Spark Comparison
From Hadoop to Spark
2. HI, Iâm Sujee Maniyam
â˘âŻ Founder / Principal @ ElephantScale
â˘âŻ Consulting & Training in Big Data
â˘âŻ Spark / Hadoop / NoSQL /
Data Science
â˘âŻ Author
â⯠âHadoop illuminatedâ open source book
â⯠âHBase Design Patternsâ
â˘âŻ Open Source contributor: github.com/sujee
â˘âŻ sujee@elephantscale.com
â˘âŻ www.ElephantScale.com
(c) ElephantScale.com 2015
Spark Training
available!
2
3. Webinar Audience
u⯠I am already using Hadoop,
Should I go to Spark?
u⯠I am thinking about Hadoop,
should I skip Hadoop and go to Spark ?
(c) ElephantScale.com 2015 3
4. Webinar Outline
u⯠Intro: what is Hadoop and what is Spark?
u⯠Capabilities and advantages of Spark & Hadoop
u⯠Best use cases for Spark / Hadoop
u⯠From Hadoop to Spark â how to?
Webinar: From Hadoop to Spark(c) ElephantScale.com 2015 4
6. Hadoop in 20 Seconds
u⯠âThe Originalâ Big data platform
u⯠Very well field tested
u⯠Scales to peta-bytes of data
u⯠Enables analytics at massive scale
(c) ElephantScale.com 2015 6
10. Hadoop : Use Cases
u⯠Two modes : Batch & Real Time
u⯠Batch use case
ââŻAnalytics at large scale (Terra bytes to peta bytes scale)
ââŻAnalytics times can be minutes / hours.
Depends on
â˘âŻ Size of data being analyzed
â˘âŻ And type of query
ââŻExamples:
â˘âŻ Large ETL work loads
â˘âŻ âAnalyze clickstream data and calculate top page visitsâ
â˘âŻ âCombine purchase data and click-data and figure out discounts to
applyâ
(c) ElephantScale.com 2015 10
11. Hadoop Use Cases
u⯠Real Time Use Cases do not rely on Map Reduce
u⯠Instead we use HBase
ââŻA real-time NoSQL datastore built on Hadoop
u⯠Example : Tracking Sensor data
ââŻStore data from millions of sensor
ââŻCould be billions of data points
ââŻâFind latest reading from a sensorâ
ââŻThis query must be done in
real time (in milli-seconds)
u⯠âNeedle in HayStackâ scenarios
ââŻWe look for one / few records within
billions
(c) ElephantScale.com 2015 11
14. Big Data Analytics Evolution (v1)
u⯠Decision times : batch ( hours / days)
u⯠Use cases:
ââŻModeling
ââŻETL
ââŻReporting
(c) ElephantScale.com 2015 14
15. Moving Towards Fast Data (v2)
u⯠Decision time : (near) real time
ââŻseconds (or milli seconds)
u⯠Use Cases
ââŻAlerts (medical / security)
ââŻFraud detection
(c) ElephantScale.com 2015 15
16. Current Big Data Processing Challenges
u⯠Processing needs outpacing 1st generation tools
u⯠Beyond Batch
ââŻNot every one has terra-bytes of data to process
ââŻSmall â Medium data sets (few hundred gigs) are more prevalent
ââŻData may not be on disk
â˘âŻ In memory
â˘âŻ Coming via streaming channels
u⯠MapReduce (MR)âs limitations
ââŻBatch processing doesn't fit all needs
ââŻNot effective for âiterative programmingâ (machine learning
algorithms ..etc)
ââŻHigh latency for streaming needs
u⯠Spark is a 2nd generation tool addressing these needs
16(c) ElephantScale.com 2015
17. What is Spark?
u⯠Open source cluster computing engine
ââŻVery fast: In-memory ops 100x faster than MR
â˘âŻ On-disk ops 10x faster than MR
ââŻGeneral purpose: MR, SQL, streaming, machine learning,
analytics
ââŻCompatible: Runs over Hadoop, Mesos, Yarn, standalone
â˘âŻ Works with HDFS, S3, Cassandra, HBase, âŚ
ââŻEasier to code: Word count in 2 lines
u⯠Spark's roots:
ââŻCame out of Berkeley AMP Lab
ââŻNow top-level Apache project
ââŻVersion 1.5 released in Sept 2015
âFirst Big Data platform to integrate batch, streaming and interactive
computations in a unified frameworkâ â stratio.com
(c) ElephantScale.com 2015 17
18. Spark Illustrated
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema /
sql
Real Time
Machine
Learning
Standalone YARN MESOS
Cluster
managers
GraphX
Graph
processing
HDFSS3 Cassandra ???
Data
Storage
(c) ElephantScale.com 2015 18
19. Spark Core
u⯠Basic building blocks for distributed compute engine
ââŻTask schedulers and memory management
ââŻFault recovery (recovers missing pieces on node failure)
ââŻStorage system interfaces
u⯠Defines Spark API and data model
u⯠Data Model: RDD (Resilient Distributed Dataset)
ââŻDistributed collection of items
ââŻCan be worked on in parallel
ââŻEasily created from many data sources (Any HDFS InputSource)
u⯠Spark API: Scala, Python, and Java
ââŻCompact API for working with RDD and interacting with Spark
ââŻMuch easier to use than MapReduce API
Session 2: Introduction to Spark
(c)
Elephant
Scale.co
m 201519
20. Spark Components
u⯠Spark SQL: Structured data
ââŻSupports SQL and HQL (Hive Query Language)
ââŻData sources include Hive tables, JSON, CSV, Parquet (1)
u⯠Spark Streaming: Live streams of data in real-time
ââŻLow latency, high throughput (1000s events / sec)
ââŻLog files, stock ticks, sensor data / IOT (Internet of Things) âŚ
u⯠ML Lib: Machine Learning at scale
ââŻClassification/regression, collaborative filtering âŚ
ââŻModel evaluation and data import
u⯠GraphX: Graph manipulation, graph-parallel computation
ââŻSocial network friendships, link data, âŚ
ââŻGraph manipulation and operations and common algorithms
Session 2: Introduction to Spark
(c)
Elephant
Scale.co
m 201520
21. Spark : 'Unified' Stack
u⯠Spark components support multiple programming models
ââŻMap reduce style batch processing
ââŻStreaming / real time processing
ââŻQuerying via SQL
ââŻMachine learning
u⯠All modules are tightly integrated
ââŻFacilitates rich applications
u⯠Spark can be the only stack you need !
ââŻNo need to run multiple clusters (Hadoop cluster, Storm cluster,
etc.)
Session 2: Introduction to Spark
(c)
Elephant
Scale.co
m 201521
29. Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Up 10x faster for data on disk
- Up to 100x faster for data in
memory
Mostly Java Compact code
Java, Python, Scala supported
No unified shell Shell for ad-hoc exploration
(c) ElephantScale.com 2015 29
30. Spark Is Better Fit for Iterative Workloads
(c) ElephantScale.com 2015 30
32. Is Spark Replacing Hadoop?
u⯠Spark runs on Hadoop / YARN
u⯠Can access data in HDFS
u⯠Use YARN for clustering
u⯠Spark programming model is more flexible than MapReduce
u⯠Spark is really great if data fits in memory (few hundred gigs),
u⯠Spark is âstorage agnosticâ (see next slide)
(c) ElephantScale.com 2015 32
37. Going from Hadoop to Spark
Introduction
Hadoop and Spark Comparison
Going from Hadoop to Spark
Session 2: Introduction to Spark
38. Why Move From Hadoop to Spark?
u⯠Spark is âeasierâ than Hadoop
u⯠âfriendlierâ for data scientists / analysts
ââŻInteractive shell
â˘âŻ fast development cycles
â˘âŻ adhoc exploration
u⯠API supports multiple languages
ââŻJava, Scala, Python
u⯠Great for small (Gigs) to medium (100s of Gigs) data
(c) ElephantScale.com 2015 38
39. Spark : âUnifiedâ Stack
u⯠Spark supports multiple programming models
ââŻMap reduce style batch processing
ââŻStreaming / real time processing
ââŻQuerying via SQL
ââŻMachine learning
u⯠All modules are tightly integrated
ââŻFacilitates rich applications
u⯠Spark can be the only stack you need !
ââŻNo need to run multiple clusters
(Hadoop cluster, Storm cluster, ⌠etc.)
Image: buymeposters.com
(c) ElephantScale.com 2015 39
40. Migrating From Hadoop Ă Spark
Functionality Hadoop Spark
Distributed Storage -⯠HDFS
-⯠Cloud storage
(Amazon S3)
-⯠HDFS
-⯠Cloud storage
(Amazon S3)
-⯠Distributed File
system (NFS /
Ceph)
-⯠Distributed NoSQL
(Cassandra)
-⯠Tachyon (in
memory)
SQL querying Hive Spark SQL (Data
frames)
ETL work flow Pig -⯠Spork : Pig on
Spark
-⯠Mix of Spark SQL +
RDD programming
Machine Learning Mahout ML Lib
NoSQL DB HBase ???
(c) ElephantScale.com 2015 40
41. Things to Consider When Moving From
Hadoop to Spark
1.⯠Data size
2.⯠File System
3.⯠Analytics
A.⯠SQL
B.⯠ETL
C.⯠Machine Learning
(c) ElephantScale.com 2015 41
42. Data Size : âYou Donât Have Big Dataâ
(c) ElephantScale.com 2015 42
43. Data Size (T-shirt sizing)
Image credit : blog.trumpi.co.za
10 G + 100 G +
1 TB + 100 TB + PB +
< few G
Hadoop / Spark
Spark
(c) ElephantScale.com 2015 43
44. Data Size
u⯠Lot of Spark adoption at SMALL â MEDIUM scale
ââŻGood fit
ââŻData might fit in memory !!
u⯠Applications
ââŻIterative workloads (Machine learning, etc.)
ââŻStreaming
(c) ElephantScale.com 2015 44
49. File Systems For Spark
HDFS NFS Amazon S3
Data locality High
(best)
Local enough None
(ok)
Throughput High
(best)
Medium
(good)
Low
(ok)
Latency Low
(best)
Low High
Reliability Very High
(replicated)
Low Very High
Cost Varies Varies $30 / TB / Month
(c) ElephantScale.com 2015 49
50. File Systems Throughput Comparison
u⯠Data : 10G + (11.3 G)
u⯠Each file : ~1+ G ( x 10)
u⯠400 million records total
u⯠Partition size : 128 M
u⯠On HDFS & S3
u⯠Cluster :
ââŻ8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD )
ââŻHadoop cluster , Horton Works HDP v2.2
ââŻSpark : on same 8 nodes, stand-alone, v 1.2
(c) ElephantScale.com 2015 50
51. HDFS Vs. S3 (lower is better)
(c)
ElephantS
cale.com
201551
52. HDFS Vs. S3 (lower is better)
(c)
ElephantS
cale.com
201552
53. HDFS Vs. S3 Conclusions
HDFS S3
Data locality Ă much higher
throughput
Data is streamed Ă lower
throughput
Need to maintain an Hadoop cluster No Hadoop cluster to maintain Ă
convenient
Large data sets (TB + ) Good use case:
-⯠Smallish data sets (few gigs)
-⯠Load once and cache and re-use
(c) ElephantScale.com 2015 53
54. Decision : File Systems
(c) ElephantScale.com 2015 54
Already have
Hadoop?
NO
HDFS
S3
NFS (Ceph)
Cassandra
(real time)
YES
use HDFS
55. Next Decision : SQL
(c) ElephantScale.com 2015 55
âWe use SQL heavily for data
mining.
We are using Hive / Impala on
Hadoop.
Is Spark right for us?â
56. SQL in Hadoop / Spark
Hadoop Spark
Engine -⯠Hive (on Map Reduce or
Tez on Hortonworks)
-⯠Impala (Cloudera)
-⯠Spark SQL using
Dataframes
-⯠Hive context
Language HiveQL - HiveQL
- RDD programming in
Java / Python / Scala
Scale Terabytes / Petabytes Gigabytes / Terabytes /
Petabytes
Inter operability Data stored in HDFS -⯠Hive tables
-⯠File system
Formats CSV, JSON, Parquet CSV, JSON, Parquet
(c) ElephantScale.com 2015 56
57. Dataframes Vs. RDDs
u⯠RDDs have data
u⯠DataFrames also have schema
u⯠Dataframes Used to be called âschemaRDDâ
u⯠Unified way to load / save data in multiple formats
u⯠Provides high level operations
ââŻCount / sum / average
ââŻSelect columns & filter them
57
(c)
Elephant
Scale.co
m 2015
// load json data
df = sqlContext.read
.format(âjsonâ)
.load(â/data/data.jsonâ)
// save as parquet (faster queries)
df.write
.format(âparquetâ)
.saveAsTable(â/data/datap/â)
60. Querying Using SQL
u⯠A DataFrame can be registered as a temporary table
ââŻYou can then use SQL to query it, as shown below
ââŻThis is handled similarly to DSL queries - building up an AST and
sending it to Catalyst
(c)
Elephant
Scale.co
m 201560 Session 6: Spark SQL
scala> df.registerTempTable("people")
scala> sqlContext.sql("select * from people").show()
name age
John 35
Jane 40
Mike 20
Sue 52
scala> sqlContext.sql("select * from people where age > 35").show()
name age
Jane 40
Sue 52
61. Going From Hive Ă Spark
u⯠Spark natively supports querying data stored in Hive tables!
u⯠Handy to use in an existing Hadoop cluster !!
61 Session 6: Spark SQL
HIVE
Hive> select customer_id, SUM(cost) as total from billing group by
customer_id order by total DESC LIMIT 10;
SPARK
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val top10 = hiveCtx.sql(
"select customer_id, SUM(cost) as total from billing group by
customer_id order by total DESC LIMIT 10")
top10.collect()
(c) ElephantScale.com 2015
62. Spark SQL Vs. Hive
(c)
ElephantS
cale.com
2015
Fast on same
HDFS data !
62
63. Spark SQL Vs. Hive
(c)
ElephantS
cale.com
2015
Fast on
same data
on HDFS
63
64. Decision : SQL
Using Hive?
Yes
Spark Using
HiveContext
NO
Spark SQL with
Dataframes
(c) ElephantScale.com 2015 64
65. Next Decision : ETL
(c) ElephantScale.com 2015 65
âwe do lot of ETL work on
our Hadoop cluster.
Using tools like Pig /
Cascading
Can we use Spark? â
66. ETL on Hadoop / Spark
ETL Hadoop Spark
ETL Tools Pig, Cascading, Oozie -⯠Native RDD
programming
(Scala, Java,
Python)
-⯠Cascading?
Pig High level ETL workflow Spork : Pig on Spark
Cascading High level Spark-scalding
Cask Works Works
(c) ElephantScale.com 2015 66
67. Data Transformation on Spark
u⯠Dataframes are great for high level manipulation of data
ââŻHigh level operations : Join / Union âŚetc
ââŻJoining / Merging disparate data sets
ââŻCan read and understand multitude of data formats (JSON /
Parquet ..etc)
ââŻVery easy to program
u⯠RDD APIs allow low level programming
ââŻComplex manipulations
ââŻLookups
ââŻSupports multiple lanaguages (Java / Scala / Python)
u⯠High level libraries are emerging
ââŻTresata
ââŻCASK
(c) ElephantScale.com 2015 67
68. Decisions : ETL
Current ETL
Pig
Spork RDD API
Dataframes
Cascading
Cascading on
Spark
RDD
Data frames
Java
MapReduce
/ Custom
RDD
Dataframes
(c) ElephantScale.com 2015 68
69. Decision : Machine Learning
(c) ElephantScale.com 2015 69
Can we use Spark for
Machine Learning?
YES
70. Machine Learning : Hadoop / Spark
Hadoop Spark
Tool Mahout MLLib
API Java Java / Scala / Python
Iterative Algorithms Slower Very fast
(in memory)
In Memory processing No YES
Mahout runs on Hadoop
or on Spark
New and young lib
Latest news! Mahout only accepts new
code that runs on Spark
Mahout & MLLib on Spark
Future? Many opinions
(c) ElephantScale.com 2015 70
71. Decision : In Memory Process
(c) ElephantScale.com 2015 71
How can we do in-
memory processing
using Spark?
72. Numbers Every One Should Know by Jeff
Dean, Fellow @ Google
Operation Cost (in nano seconds)
L1 cache reference 0.5
Branch mispredict (cpu) 5
L2 cache reference 7
Mutex lock/unlock 100
Main memory reference 100
Compress 1K bytes with Zippy 10,000
Send 2K bytes over 1 Gbps network 20,000
Read 1 MB sequentially from memory 250,000 0.25 ms
Round trip within same datacenter 500,000 0.5 ms
Disk seek 10,000,000 10 ms
Read 1 MB sequentially from network 10,000,000 10 ms
Read 1 MB sequentially from disk 30,000,000 30 ms
Send packet CA->Netherlands->CA 150,000,000 150 ms
(c) ElephantScale.com 2015 72
73. Spark Caching
u⯠Caching is pretty effective (small / medium data sets)
u⯠Cached data can not be shared across applications
(each application executes in its own sandbox)
(c) ElephantScale.com 2015 73
76. Sharing Cached Data
u⯠By default Spark applications can not share cached data
ââŻRunning in isolation
u⯠1) âspark job serverâ
ââŻMultiplexer
ââŻAll requests are executed through same âcontextâ
ââŻProvides web-service interface
u⯠2) Tachyon
ââŻDistributed In-memory file system
ââŻMemory is the new disk!
ââŻOut of AMP lab , Berkeley
ââŻEarly stages (very promising)
(c) ElephantScale.com 2015 76
80. How to Get Spark?
Session 2: Introduction to Spark
81. Getting Spark
(c) ElephantScale.com 2015 81
Running Hadoop?
NO
Need HDFS?
YES
Install HDFS +
YARN + Spark
NO
Environment ?
Production
Spark + Mesos +
S3
Testing
Spark (standalone)
+ NFS or S3
YES
Install Spark on
Hadoop cluster
82. Spark Cluster Setup 1 : Simple
u⯠Great for POCs / experimentation
u⯠No dependencies
u⯠Using Sparkâs âstand aloneâ manager
(c) ElephantScale.com 2015 82
83. Spark Cluster Setup 2 : Production
u⯠Works well with Hadoop eco system (HDFS / Hive ..etc)
u⯠Best way to adopt Spark on Hadoop
u⯠Uses YARN as cluster manager
(c) ElephantScale.com 2015 83
84. Spark Cluster Setup 3 : Production
u⯠Uses Mesos as cluster manager
(c) ElephantScale.com 2015 84
86. Use Case 1 : Moving to Cloud
(c) ElephantScale.com 2015 86
87. Use Case 1 : Lessons Learned
u⯠Size
ââŻSmall Hadoop cluster (8 nodes)
ââŻSmallish data : 50G â 300G
ââŻData for processing : few Gigs per query
u⯠Good !
ââŻOnly one moving part 'spark'
ââŻNo Hadoop cluster to maintain
ââŻS3 was a dependable storage (passive)
ââŻQuery response time gone from minutes to seconds (b/c we went
from MR Ă Spark)
u⯠Not so good
ââŻWe lost data locality of HDFS
(ok for small/medium data sets)
(c) ElephantScale.com 2015 87
88. Use Case 2 : Persistent Caching in Spark
u⯠Can we improve latency in this setup?
u⯠Caching will help
u⯠How ever, in Spark cached data can not be shared across
applications L
(c) ElephantScale.com 2015 88
89. Use Case 2 : Persistent Caching in Spark
u⯠Spark Job Server to rescue !
(c) ElephantScale.com 2015 89
90. Final Thoughts
u⯠Already on Hadoop?
ââŻTry Spark side-by-side
ââŻProcess some data in HDFS
ââŻTry Spark SQL for Hive tables
u⯠Contemplating Hadoop?
ââŻTry Spark (standalone)
ââŻChoose NFS or S3 file system
u⯠Take advantage of caching
ââŻIterative loads
ââŻSpark Job server
ââŻTachyon
(c) ElephantScale.com 2015 90
91. Thanks and questions?
Sujee Maniyam
Founder / Principal @ ElephantScale
Expert Consulting + Training in Big Data technologies
sujee@elephantscale.com
Elephantscale.com
Sign up for upcoming trainings : ElephantScale.com/training
(c) ElephantScale.com 2015 91