SlideShare a Scribd company logo
1 of 38
Download to read offline
Introduction to 
By Tsai Li Ming 
Spark Meetup (SG) - 26 Nov 2014
Who am I? 
• Started with High Performance 
Computing -> Grid -> Cloud 
and now Big Data 
! 
• Member of PUGS committee 
! 
• http://about.me/tsailiming
Why In Memory 
Computing?
Speed matters 
http://web.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf
http://hblok.net/blog/storage/
18:1 server consolidation
In Memory Data Grid 
Ignite vs. Spark 
! 
Apache Spark is a data-analytic and ML centric system that ingest data from HDFS or another distributed file 
system and performs in-memory processing of this data. Ignite is an In-Memory Data Fabric that is 
data source agnostic and provides both Hadoop-like computation engine 
(MapReduce) as well as many other computing paradigms like MPP, MPI, Streaming 
processing. Ignite also includes first-class level support for cluster management and operations, cluster-aware 
messaging and zero-deployment technologies. Ignite also provides support for full ACID transactions 
spanning memory and optional data sources. Ignite is a broader in-memory system that is less 
focused on Hadoop. Apache Spark is more inclined towards analytics and ML, and 
focused on MR-specific payloads. 
https://wiki.apache.org/incubator/IgniteProposal
In Memory Database 
And many more! 
http://en.wikipedia.org/wiki/List_of_in-memory_databases
What is Spark?
What is Spark? 
• Developed at UC Berkley in 2009. Open 
sourced in 2010. 
• Fast and general engine for large-scale data 
processing 
! 
• Run programs up to 100x faster than Hadoop 
MapReduce in memory, or 10x faster on disk 
! 
• Multi-step Directed Acrylic Graphs (DAGs). 
Many stages compared to just Hadoop Map 
and Reduce only. 
! 
• Rich Scala, Java and Python APIs. R too! 
(SparkR to be merged into 1.3.0) 
! 
• Interactive Shell 
! 
• Active development
Spark Stack
Active Development 
*June 2014* 
http://radar.oreilly.com/2014/06/a-growing-number-of-applications-are-being-built-with-spark. 
html
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
Logistic Regression Performance 
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
How Spark works
Spark Execution Flow 
YARN 
Standalone
Resilient Distributed Datasets (RDDs) 
• Basic abstraction in Spark. Fault-tolerant collection of 
elements that can be operated on in parallel 
! 
• RDDs can be created from local file system, HDFS, 
Cassandra, HBase, Amazon S3, SequenceFiles, and 
any other Hadoop InputFormat. 
! 
• Different levels of caching: MEMORY_ONLY, 
MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc 
! 
• Rich APIs for Transformations and Actions 
! 
• Data Locality: PROCESS_LOCAL -> NODE_LOCAL -> 
RACK_LOCAL
RDD Operations 
map flatMap sortByKey 
filter union reduce 
sample join count 
groupByKey distinct saveAsTextFile 
reduceByKey mapValues first
RDDs Transformations Examples (1) 
map(func) 
! >>> myList = [1,2,3,4,5]! 
! >>> myRDD = sc.parallelize(myList)! 
! >>> myRDD.map(lambda x: x+1).collect()! 
! [2, 3, 4, 5, 6]! 
! 
filter(func) 
! >>> myRDD.filter(lambda x: x > 3).collect()! 
! [4, 5]
RDDs Transformations Examples (2) 
reduceByKey(func, [numTasks]) 
! >>> myList = [1,2,2,3,3,3,4,4,4,4,5,5,5,5,5]! 
! >>> myRDD = sc.parallelize(myList)! 
! >>> pairs = myRDD.map(lambda x: (x,1))! 
! >>> pairs.collect()[:3]! 
! [(1, 1), (2, 1), (2, 1)]! 
! >>> counts = pairs.reduceByKey(lambda a, b: a + b)! 
! >>> counts.collect()! 
! [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]! 
! 
GroupByKey([numTasks]) 
! >>> for key, values in pairs.groupByKey().collect():! 
! ... ! print "%s : %s" % (key, list(values))! 
! 1 : [1]! 
! 2 : [1, 1]! 
! 3 : [1, 1, 1]! 
!
RDDs Actions Examples (1) 
reduce(func)! 
! >>> myList = [1,2,3,4,5]! 
! >>> myRDD = sc.parallelize(myList)! 
! >>> myRDD.reduce(lambda x,y: x+y)! 
! 15! 
! 
collect() 
! >>> myList = [1,2,3,4,5] 
! >>> myRDD = sc.parallelize(myList)! 
! >>> myRDD.collect()! 
! [1, 2, 3, 4, 5]
RDDs Actions Examples (2) 
count()! 
! >>> myList = [1,2,3,4,5] 
! >>> myRDD = sc.parallelize(myList)! 
! >>> myRDD.count()! 
! 5! 
! 
take(3)! 
! >>> myList = [1,2,3,4,5] 
! >>> myRDD = sc.parallelize(myList)! 
! >>> myRDD.take(3)! 
! [1, 2, 3]
Fault Recovery 
• An RDD is an immutable, deterministically re-computable, 
distributed dataset. 
! 
• RDD tracks lineage info rebuild lost data
Types of Deployment 
• Standalone Deployment 
$ sbin/start-all.sh 
! 
• Amazon EC2 
$ ec2/spark-ec2 -k <keypair> 
-i <key-file> -s <num-slaves> 
launch <cluster-name> 
! 
• Apache Mesos 
! 
• Hadoop YARN
Other Spark Projects
Other Spark Projects (1) 
• Spark Job Server 
! 
"Spark as a Service": Simple REST interface for all 
aspects of job, context management 
! 
https://github.com/ooyala/spark-jobserver 
! 
• Hue support for Spark 
! 
http://gethue.com/a-new-spark-web-ui-spark-app/
Other Spark Projects (2) 
• Tachyon 
! 
“Tachyon is a memory-centric distributed file system 
enabling reliable file sharing at memory-speed 
across cluster frameworks” 
! 
http://tachyon-project.org/ 
! 
• Spark Notebook 
! 
“Use Apache Spark straight from the Browser” 
! 
https://github.com/andypetrella/spark-notebook/
Spark Example
Wordcount Example (1) 
Hadoop MapReduce Spark Scala 
//package org.myorg;! 
import java.io.IOException;! 
i!mport java.util.*;! 
import org.apache.hadoop.fs.Path;! 
import org.apache.hadoop.conf.*;! 
import org.apache.hadoop.io.*;! 
import org.apache.hadoop.mapred.*;! 
i!mport org.apache.hadoop.util.*;! 
public class WordCount {! !! 
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, 
Text, IntWritable> {! 
! ! private final static IntWritable one = new IntWritable(1);! 
!! ! private Text word = new Text();! 
! ! public void map(LongWritable key, Text value, OutputCollector<Text, 
IntWritable> output, Reporter reporter) throws IOException {! 
! ! ! String line = value.toString();! 
! ! ! StringTokenizer tokenizer = new StringTokenizer(line);! 
! ! ! while (tokenizer.hasMoreTokens()) {! 
! ! ! ! word.set(tokenizer.nextToken());! 
! ! ! ! output.collect(word, one);! 
! ! ! }! 
! ! }! 
! }! !! 
public static class Reduce extends MapReduceBase implements Reducer<Text, 
IntWritable, Text, IntWritable> {! 
! ! public void reduce(Text key, Iterator<IntWritable> values, 
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! 
! ! ! int sum = 0;! 
! ! ! while (values.hasNext()) {! 
! ! ! ! sum += values.next().get();! 
! ! ! }! 
! ! ! output.collect(key, new IntWritable(sum));! 
! ! }! 
! }! !! 
public static void main(String[] args) throws Exception {! 
! ! JobConf conf = new JobConf(WordCount.class);! 
!! ! conf.setJobName("wordcount");! 
! ! conf.setOutputKeyClass(Text.class);! 
!! ! conf.setOutputValueClass(IntWritable.class);! 
! ! conf.setMapperClass(Map.class);! 
! ! //conf.setCombinerClass(Reduce.class);! 
!! ! conf.setReducerClass(Reduce.class);! 
! ! conf.setInputFormat(TextInputFormat.class);! 
!! ! conf.setOutputFormat(TextOutputFormat.class);! 
! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));! 
!! ! FileOutputFormat.setOutputPath(conf, new Path(args[1]));! 
! ! JobClient.runJob(conf);! 
! }! 
val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line..map(word => (word, .reduceByKey(_ + _)! 
counts.saveAsTextFile("hdfs://...") 
Spark Python 
file = spark.textFile("hdfs://...")! 
counts = file.flatMap(lambda line: line.split(" .map(lambda word: (word, 1)) ! 
.reduceByKey(lambda a, b: a + b)! 
counts.saveAsTextFile("hdfs://...")
Wordcount Example (2) 
$ bin/pyspark --master local[2] 
! 
Welcome to 
____ __ 
/ __/__ ___ _____/ /__ 
_ / _ / _ `/ __/ '_/ 
/__ / .__/_,_/_/ /_/_ version 1.1.0 
/_/ 
! 
Using Python version 2.7.3 (v2.7.3:70274d53c1dd, Apr 9 2012 20:52:43) 
SparkContext available as sc. 
>>> 
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). 
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. 
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
Wordcount Example (3) 
>>> file = sc.textFile("file:///Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md").cache() 
14/11/25 15:33:14 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=327410, 
maxMem=278302556 
14/11/25 15:33:14 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 159.9 KB, 
free 264.9 MB) 
! 
>>> file.count() 
14/11/25 15:33:34 INFO DAGScheduler: Stage 0 (count at <stdin>:1) finished in 2.185 s 
14/11/25 15:33:34 INFO SparkContext: Job finished: count at <stdin>:1, took 2.301952 s 
141 
! 
>>> counts = file.flatMap(lambda line: line.split())  
... .map(lambda word: (word, 1))  
... .reduceByKey(lambda a, b: a + b) 
! 
>>> counts.take(5) 
[(u'email,', 1), (u'Contributing', 1), (u'webpage', 1), (u'when', 3), (u'<http://spark.apache.org/ 
documentation.html>.', 1)]
Actual Demo
Spark in Enterprises
Spark in Enterprise 
Traditional Data Sources 
(RDBMS, EDW, MPP, Excel, CSV) 
New Sources 
(Web, Logs, Social Media, IOT)! 
In Memory = Fast! 
Real Time Streaming 
+ 
Analytics 
+ 
Spark SQL 
Batch Processing 
HiveQL 
Data Sources Data Processing Applications 
BI!Tools! Enterprise!! 
Applica2ons! 
Custom!! 
Applica2ons!
Enterprise Ready? 
• Authentication? Kerberos? AD? LDAP? 
! 
• Encryption? SSL? TLS? 
! 
• Audit Trail? 
! 
• Commercial Support? 
! 
• …. ?
Use Cases: #1 
Music Recommendations at Scale with Spark 
http://spark-summit.org/2014/talk/music-recommendations-at-scale-with-spark
Thank You!

More Related Content

What's hot

Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 

What's hot (20)

Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Spark
SparkSpark
Spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Spark
SparkSpark
Spark
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Apache spark
Apache sparkApache spark
Apache spark
 

Viewers also liked

Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analyticsEdureka!
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaEdureka!
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduceEdureka!
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...Edureka!
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerIMC Institute
 
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...Edureka!
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 

Viewers also liked (11)

Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & Scala
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 

Similar to Introduction to Spark

Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 

Similar to Introduction to Spark (20)

Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Spark 101
Spark 101Spark 101
Spark 101
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark core
Spark coreSpark core
Spark core
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Introduction to Spark

  • 1. Introduction to By Tsai Li Ming Spark Meetup (SG) - 26 Nov 2014
  • 2. Who am I? • Started with High Performance Computing -> Grid -> Cloud and now Big Data ! • Member of PUGS committee ! • http://about.me/tsailiming
  • 3. Why In Memory Computing?
  • 7. In Memory Data Grid Ignite vs. Spark ! Apache Spark is a data-analytic and ML centric system that ingest data from HDFS or another distributed file system and performs in-memory processing of this data. Ignite is an In-Memory Data Fabric that is data source agnostic and provides both Hadoop-like computation engine (MapReduce) as well as many other computing paradigms like MPP, MPI, Streaming processing. Ignite also includes first-class level support for cluster management and operations, cluster-aware messaging and zero-deployment technologies. Ignite also provides support for full ACID transactions spanning memory and optional data sources. Ignite is a broader in-memory system that is less focused on Hadoop. Apache Spark is more inclined towards analytics and ML, and focused on MR-specific payloads. https://wiki.apache.org/incubator/IgniteProposal
  • 8. In Memory Database And many more! http://en.wikipedia.org/wiki/List_of_in-memory_databases
  • 10. What is Spark? • Developed at UC Berkley in 2009. Open sourced in 2010. • Fast and general engine for large-scale data processing ! • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk ! • Multi-step Directed Acrylic Graphs (DAGs). Many stages compared to just Hadoop Map and Reduce only. ! • Rich Scala, Java and Python APIs. R too! (SparkR to be merged into 1.3.0) ! • Interactive Shell ! • Active development
  • 12. Active Development *June 2014* http://radar.oreilly.com/2014/06/a-growing-number-of-applications-are-being-built-with-spark. html
  • 15. Logistic Regression Performance http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
  • 17. Spark Execution Flow YARN Standalone
  • 18. Resilient Distributed Datasets (RDDs) • Basic abstraction in Spark. Fault-tolerant collection of elements that can be operated on in parallel ! • RDDs can be created from local file system, HDFS, Cassandra, HBase, Amazon S3, SequenceFiles, and any other Hadoop InputFormat. ! • Different levels of caching: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc ! • Rich APIs for Transformations and Actions ! • Data Locality: PROCESS_LOCAL -> NODE_LOCAL -> RACK_LOCAL
  • 19. RDD Operations map flatMap sortByKey filter union reduce sample join count groupByKey distinct saveAsTextFile reduceByKey mapValues first
  • 20. RDDs Transformations Examples (1) map(func) ! >>> myList = [1,2,3,4,5]! ! >>> myRDD = sc.parallelize(myList)! ! >>> myRDD.map(lambda x: x+1).collect()! ! [2, 3, 4, 5, 6]! ! filter(func) ! >>> myRDD.filter(lambda x: x > 3).collect()! ! [4, 5]
  • 21. RDDs Transformations Examples (2) reduceByKey(func, [numTasks]) ! >>> myList = [1,2,2,3,3,3,4,4,4,4,5,5,5,5,5]! ! >>> myRDD = sc.parallelize(myList)! ! >>> pairs = myRDD.map(lambda x: (x,1))! ! >>> pairs.collect()[:3]! ! [(1, 1), (2, 1), (2, 1)]! ! >>> counts = pairs.reduceByKey(lambda a, b: a + b)! ! >>> counts.collect()! ! [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]! ! GroupByKey([numTasks]) ! >>> for key, values in pairs.groupByKey().collect():! ! ... ! print "%s : %s" % (key, list(values))! ! 1 : [1]! ! 2 : [1, 1]! ! 3 : [1, 1, 1]! !
  • 22. RDDs Actions Examples (1) reduce(func)! ! >>> myList = [1,2,3,4,5]! ! >>> myRDD = sc.parallelize(myList)! ! >>> myRDD.reduce(lambda x,y: x+y)! ! 15! ! collect() ! >>> myList = [1,2,3,4,5] ! >>> myRDD = sc.parallelize(myList)! ! >>> myRDD.collect()! ! [1, 2, 3, 4, 5]
  • 23. RDDs Actions Examples (2) count()! ! >>> myList = [1,2,3,4,5] ! >>> myRDD = sc.parallelize(myList)! ! >>> myRDD.count()! ! 5! ! take(3)! ! >>> myList = [1,2,3,4,5] ! >>> myRDD = sc.parallelize(myList)! ! >>> myRDD.take(3)! ! [1, 2, 3]
  • 24. Fault Recovery • An RDD is an immutable, deterministically re-computable, distributed dataset. ! • RDD tracks lineage info rebuild lost data
  • 25. Types of Deployment • Standalone Deployment $ sbin/start-all.sh ! • Amazon EC2 $ ec2/spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name> ! • Apache Mesos ! • Hadoop YARN
  • 27. Other Spark Projects (1) • Spark Job Server ! "Spark as a Service": Simple REST interface for all aspects of job, context management ! https://github.com/ooyala/spark-jobserver ! • Hue support for Spark ! http://gethue.com/a-new-spark-web-ui-spark-app/
  • 28. Other Spark Projects (2) • Tachyon ! “Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks” ! http://tachyon-project.org/ ! • Spark Notebook ! “Use Apache Spark straight from the Browser” ! https://github.com/andypetrella/spark-notebook/
  • 30. Wordcount Example (1) Hadoop MapReduce Spark Scala //package org.myorg;! import java.io.IOException;! i!mport java.util.*;! import org.apache.hadoop.fs.Path;! import org.apache.hadoop.conf.*;! import org.apache.hadoop.io.*;! import org.apache.hadoop.mapred.*;! i!mport org.apache.hadoop.util.*;! public class WordCount {! !! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! ! ! private final static IntWritable one = new IntWritable(1);! !! ! private Text word = new Text();! ! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! ! ! ! String line = value.toString();! ! ! ! StringTokenizer tokenizer = new StringTokenizer(line);! ! ! ! while (tokenizer.hasMoreTokens()) {! ! ! ! ! word.set(tokenizer.nextToken());! ! ! ! ! output.collect(word, one);! ! ! ! }! ! ! }! ! }! !! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! ! ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! ! ! ! int sum = 0;! ! ! ! while (values.hasNext()) {! ! ! ! ! sum += values.next().get();! ! ! ! }! ! ! ! output.collect(key, new IntWritable(sum));! ! ! }! ! }! !! public static void main(String[] args) throws Exception {! ! ! JobConf conf = new JobConf(WordCount.class);! !! ! conf.setJobName("wordcount");! ! ! conf.setOutputKeyClass(Text.class);! !! ! conf.setOutputValueClass(IntWritable.class);! ! ! conf.setMapperClass(Map.class);! ! ! //conf.setCombinerClass(Reduce.class);! !! ! conf.setReducerClass(Reduce.class);! ! ! conf.setInputFormat(TextInputFormat.class);! !! ! conf.setOutputFormat(TextOutputFormat.class);! ! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));! !! ! FileOutputFormat.setOutputPath(conf, new Path(args[1]));! ! ! JobClient.runJob(conf);! ! }! val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line..map(word => (word, .reduceByKey(_ + _)! counts.saveAsTextFile("hdfs://...") Spark Python file = spark.textFile("hdfs://...")! counts = file.flatMap(lambda line: line.split(" .map(lambda word: (word, 1)) ! .reduceByKey(lambda a, b: a + b)! counts.saveAsTextFile("hdfs://...")
  • 31. Wordcount Example (2) $ bin/pyspark --master local[2] ! Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 1.1.0 /_/ ! Using Python version 2.7.3 (v2.7.3:70274d53c1dd, Apr 9 2012 20:52:43) SparkContext available as sc. >>> --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
  • 32. Wordcount Example (3) >>> file = sc.textFile("file:///Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md").cache() 14/11/25 15:33:14 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=327410, maxMem=278302556 14/11/25 15:33:14 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 159.9 KB, free 264.9 MB) ! >>> file.count() 14/11/25 15:33:34 INFO DAGScheduler: Stage 0 (count at <stdin>:1) finished in 2.185 s 14/11/25 15:33:34 INFO SparkContext: Job finished: count at <stdin>:1, took 2.301952 s 141 ! >>> counts = file.flatMap(lambda line: line.split()) ... .map(lambda word: (word, 1)) ... .reduceByKey(lambda a, b: a + b) ! >>> counts.take(5) [(u'email,', 1), (u'Contributing', 1), (u'webpage', 1), (u'when', 3), (u'<http://spark.apache.org/ documentation.html>.', 1)]
  • 35. Spark in Enterprise Traditional Data Sources (RDBMS, EDW, MPP, Excel, CSV) New Sources (Web, Logs, Social Media, IOT)! In Memory = Fast! Real Time Streaming + Analytics + Spark SQL Batch Processing HiveQL Data Sources Data Processing Applications BI!Tools! Enterprise!! Applica2ons! Custom!! Applica2ons!
  • 36. Enterprise Ready? • Authentication? Kerberos? AD? LDAP? ! • Encryption? SSL? TLS? ! • Audit Trail? ! • Commercial Support? ! • …. ?
  • 37. Use Cases: #1 Music Recommendations at Scale with Spark http://spark-summit.org/2014/talk/music-recommendations-at-scale-with-spark