Introduction to Spark

Introduction to
By Tsai Li Ming
Spark Meetup (SG) - 26 Nov 2014

Who am I?
• Started with High Performance
Computing -> Grid -> Cloud
and now Big Data
!
• Member of PUGS committee
!
• http://about.me/tsailiming

Speed matters
http://web.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf

http://hblok.net/blog/storage/

In Memory Data Grid
Ignite vs. Spark
!
Apache Spark is a data-analytic and ML centric system that ingest data from HDFS or another distributed file
system and performs in-memory processing of this data. Ignite is an In-Memory Data Fabric that is
data source agnostic and provides both Hadoop-like computation engine
(MapReduce) as well as many other computing paradigms like MPP, MPI, Streaming
processing. Ignite also includes first-class level support for cluster management and operations, cluster-aware
messaging and zero-deployment technologies. Ignite also provides support for full ACID transactions
spanning memory and optional data sources. Ignite is a broader in-memory system that is less
focused on Hadoop. Apache Spark is more inclined towards analytics and ML, and
focused on MR-specific payloads.
https://wiki.apache.org/incubator/IgniteProposal

In Memory Database
And many more!
http://en.wikipedia.org/wiki/List_of_in-memory_databases

What is Spark?
• Developed at UC Berkley in 2009. Open
sourced in 2010.
• Fast and general engine for large-scale data
processing
!
• Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk
!
• Multi-step Directed Acrylic Graphs (DAGs).
Many stages compared to just Hadoop Map
and Reduce only.
!
• Rich Scala, Java and Python APIs. R too!
(SparkR to be merged into 1.3.0)
!
• Interactive Shell
!
• Active development

Active Development
*June 2014*
http://radar.oreilly.com/2014/06/a-growing-number-of-applications-are-being-built-with-spark.
html

http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012

Logistic Regression Performance
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012

Spark Execution Flow
YARN
Standalone

Resilient Distributed Datasets (RDDs)
• Basic abstraction in Spark. Fault-tolerant collection of
elements that can be operated on in parallel
!
• RDDs can be created from local file system, HDFS,
Cassandra, HBase, Amazon S3, SequenceFiles, and
any other Hadoop InputFormat.
!
• Different levels of caching: MEMORY_ONLY,
MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc
!
• Rich APIs for Transformations and Actions
!
• Data Locality: PROCESS_LOCAL -> NODE_LOCAL ->
RACK_LOCAL

RDD Operations
map flatMap sortByKey
filter union reduce
sample join count
groupByKey distinct saveAsTextFile
reduceByKey mapValues first

RDDs Transformations Examples (1)
map(func)
! >>> myList = [1,2,3,4,5]!
! >>> myRDD = sc.parallelize(myList)!
! >>> myRDD.map(lambda x: x+1).collect()!
! [2, 3, 4, 5, 6]!
!
filter(func)
! >>> myRDD.filter(lambda x: x > 3).collect()!
! [4, 5]

RDDs Transformations Examples (2)
reduceByKey(func, [numTasks])
! >>> myList = [1,2,2,3,3,3,4,4,4,4,5,5,5,5,5]!
! >>> pairs = myRDD.map(lambda x: (x,1))!
! >>> pairs.collect()[:3]!
! [(1, 1), (2, 1), (2, 1)]!
! >>> counts = pairs.reduceByKey(lambda a, b: a + b)!
! >>> counts.collect()!
! [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]!
!
GroupByKey([numTasks])
! >>> for key, values in pairs.groupByKey().collect():!
! ... ! print "%s : %s" % (key, list(values))!
! 1 : [1]!
! 2 : [1, 1]!
! 3 : [1, 1, 1]!
!

RDDs Actions Examples (1)
reduce(func)!
! >>> myList = [1,2,3,4,5]!
! >>> myRDD.reduce(lambda x,y: x+y)!
! 15!
!
collect()
! >>> myList = [1,2,3,4,5]
! >>> myRDD.collect()!
! [1, 2, 3, 4, 5]

RDDs Actions Examples (2)
count()!
! >>> myList = [1,2,3,4,5]
! >>> myRDD.count()!
! 5!
!
take(3)!
! >>> myList = [1,2,3,4,5]
! >>> myRDD.take(3)!
! [1, 2, 3]

Fault Recovery
• An RDD is an immutable, deterministically re-computable,
distributed dataset.
!
• RDD tracks lineage info rebuild lost data

Types of Deployment
• Standalone Deployment
$ sbin/start-all.sh
!
• Amazon EC2
$ ec2/spark-ec2 -k <keypair>
-i <key-file> -s <num-slaves>
launch <cluster-name>
!
• Apache Mesos
!
• Hadoop YARN

Other Spark Projects (1)
• Spark Job Server
!
"Spark as a Service": Simple REST interface for all
aspects of job, context management
!
https://github.com/ooyala/spark-jobserver
!
• Hue support for Spark
!
http://gethue.com/a-new-spark-web-ui-spark-app/

Other Spark Projects (2)
• Tachyon
!
“Tachyon is a memory-centric distributed file system
enabling reliable file sharing at memory-speed
across cluster frameworks”
!
http://tachyon-project.org/
!
• Spark Notebook
!
“Use Apache Spark straight from the Browser”
!
https://github.com/andypetrella/spark-notebook/

Wordcount Example (1)
Hadoop MapReduce Spark Scala
//package org.myorg;!
import java.io.IOException;!
i!mport java.util.*;!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoop.conf.*;!
import org.apache.hadoop.io.*;!
import org.apache.hadoop.mapred.*;!
i!mport org.apache.hadoop.util.*;!
public class WordCount {! !!
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable> {!
! ! private final static IntWritable one = new IntWritable(1);!
!! ! private Text word = new Text();!
! ! public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {!
! ! ! String line = value.toString();!
! ! ! StringTokenizer tokenizer = new StringTokenizer(line);!
! ! ! while (tokenizer.hasMoreTokens()) {!
! ! ! ! word.set(tokenizer.nextToken());!
! ! ! ! output.collect(word, one);!
! ! ! }!
! ! }!
! }! !!
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {!
! ! public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {!
! ! ! int sum = 0;!
! ! ! while (values.hasNext()) {!
! ! ! ! sum += values.next().get();!
! ! ! }!
! ! ! output.collect(key, new IntWritable(sum));!
! ! }!
! }! !!
public static void main(String[] args) throws Exception {!
! ! JobConf conf = new JobConf(WordCount.class);!
!! ! conf.setJobName("wordcount");!
! ! conf.setOutputKeyClass(Text.class);!
!! ! conf.setOutputValueClass(IntWritable.class);!
! ! conf.setMapperClass(Map.class);!
! ! //conf.setCombinerClass(Reduce.class);!
!! ! conf.setReducerClass(Reduce.class);!
! ! conf.setInputFormat(TextInputFormat.class);!
!! ! conf.setOutputFormat(TextOutputFormat.class);!
! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));!
!! ! FileOutputFormat.setOutputPath(conf, new Path(args[1]));!
! ! JobClient.runJob(conf);!
! }!
val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line..map(word => (word, .reduceByKey(_ + _)!
counts.saveAsTextFile("hdfs://...")
Spark Python
file = spark.textFile("hdfs://...")!
counts = file.flatMap(lambda line: line.split(" .map(lambda word: (word, 1)) !
.reduceByKey(lambda a, b: a + b)!
counts.saveAsTextFile("hdfs://...")

$ bin/pyspark --master local[2]
!
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 1.1.0
/_/
!
Using Python version 2.7.3 (v2.7.3:70274d53c1dd, Apr 9 2012 20:52:43)
SparkContext available as sc.
>>>
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M).
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).

>>> file = sc.textFile("file:///Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md").cache()
14/11/25 15:33:14 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=327410,
maxMem=278302556
14/11/25 15:33:14 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 159.9 KB,
free 264.9 MB)
!
>>> file.count()
14/11/25 15:33:34 INFO DAGScheduler: Stage 0 (count at <stdin>:1) finished in 2.185 s
14/11/25 15:33:34 INFO SparkContext: Job finished: count at <stdin>:1, took 2.301952 s
141
!
>>> counts = file.flatMap(lambda line: line.split())
... .map(lambda word: (word, 1))
... .reduceByKey(lambda a, b: a + b)
!
>>> counts.take(5)
[(u'email,', 1), (u'Contributing', 1), (u'webpage', 1), (u'when', 3), (u'<http://spark.apache.org/
documentation.html>.', 1)]

Spark in Enterprise
Traditional Data Sources
(RDBMS, EDW, MPP, Excel, CSV)
New Sources
(Web, Logs, Social Media, IOT)!
In Memory = Fast!
Real Time Streaming
+
Analytics
+
Spark SQL
Batch Processing
HiveQL
Data Sources Data Processing Applications
BI!Tools! Enterprise!!
Applica2ons!
Custom!!
Applica2ons!

Enterprise Ready?
• Authentication? Kerberos? AD? LDAP?
!
• Encryption? SSL? TLS?
!
• Audit Trail?
!
• Commercial Support?
!
• …. ?

Use Cases: #1
Music Recommendations at Scale with Spark
http://spark-summit.org/2014/talk/music-recommendations-at-scale-with-spark

Introduction to Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Introduction to Spark

Similar to Introduction to Spark (20)

Recently uploaded

Recently uploaded (20)

Introduction to Spark