SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Using Apache Spark
PRESENTER: SNEHA CHALLA
VENUE: GOOGLE, PITTSBURGH
DATE: AUGUST 25TH, 2015
What is Apache Spark?
Spark is an open source computation engine built on top of the popular Hadoop
Distributed File System (HDFS). Fast and general cluster computing engine for large scale
data processing
Efficient Usable
Offers In memory computing
DAG Execution Engine
Up to 10× faster on disk,
100× in memory 2-5× less code
Rich APIs in Java,
Scala, Python
Interactive shell
Why Spark? Runs Everywhere
• Standalone Mode (private
cluster)
• Apache Mesos – Cluster
manager
• Hadoop YARN
• Amazon EC2 (prepared
deployment)
Spark Community
Spark was initially developed by UC Berkeley, AMP Lab and is being used and developed in a wide
variety of companies.
 MOST ACTIVE OPEN SOURCE PROJECTS IN BIG DATA
 More than 150 Contributors in the past One Year
 25 + Companies Contributing and it’s growing.
Spark was designed to both make traditional Map Reduce programming easier and to support
new types of applications, with one of the earliest focus areas being machine learning. Spark can
be used to build fast end-to-end Machine Learning workflows.
SPARK COMMUNITY
Elephant in the room- Map Reduce
Why Apache Spark?
Spark VS Map Reduce
 Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
 Spark has an advanced DAG execution engine that supports
cyclic data flow and in-memory computing.
Why Spark?
Spark Installation
http://spark.apache.org/downlo
ads.html
Spark Installation
• Extract compressed folder spark-1.3.0-binhadoop2.4
• From terminal, go to spark-1.3.0-bin-hadoop2.4/
bin
• Run pyspark
• Run rdd = sc.parallelize([0, 1, 2]);
rdd.map(lambda x: x*x).collect()
• Get result [0,1,4]
• It’s that easy!
 Windows users might need to download and
run additional winutils.exe for smooth
running of applications
 Download winutils here
http://www.srccodes.com/p/article/39/error
-util-shell-failed-locate-winutils-binary-
hadoop-binary-path.
and add to $HADOOP_HOME/bin
• Download a bigger zip (1.9GB) from
http://bit.ly/1FpZAXH
Interactive Shell & Spark Context
Interactive Shell:
The Fastest Way to Learn Spark
Available in Python and Scala
Runs as an application on an existing Spark Cluster.
OR Can run locally
Spark Context:
Main entry point to Spark functionality
Available in shell as variable sc
In standalone programs, you’d make your
own by calling SC object(see later for details)
Key Concepts – RDD Distributional Model
RDD – Resilient Distributed Dataset
Programs are written in terms of transformations on these Distributed Datasets
• RDD = Resilient
Distributed Database
• Transformations
convert one RDD into
another
• No actual calculation
• Actions force
calculation of result
• Lazy evaluation
RDD’s - Motivation
RDDs are motivated by two types of applications that current data flow systems handle
inefficiently:
1) Iterative algorithms - Common in graph applications and Machine Learning
2) Interactive Data Mining Tools
To achieve fault tolerance efficiently - RDDs provide a highly restricted form of shared memory:
they are read-only datasets that can only be constructed through bulk operations on other
RDDs.
However, RDDs are expressive enough to capture a wide class of computations, including
MapReduce and specialized computations.
Spark Architecture
Basic Abstraction: RDDs
Immutability
Laziness
Transformations
Programming Model
Two types of operations on an RDD:
• transformations
• actions
Transformations are lazily evaluated - they are not executed when you issue the
command.
RDDs are recomputed when an action is executed.
Data distribution/partitioning
RDD - read only collection of objects that are partitioned across a set of machines.
RDD Computational Model
• Operators on RDDs form a directed acyclic graph.
• If any partition on dead workers is lost, it can be recomputed by retracing the operator DAG.
FIRST STEP IN DATA ANALYSIS :
Create an RDD
Read data from a text file on local machine , S3 or HDFS into RDD.
Give life to an RDD using Spark Context
# Convert a python collection to an RDD
# Turn a Python collection into an RDD
>sc.parallelize ([7, 8, 9])
# Load text file from local FS, HDFS, or S3
>sc.textFile(“textfile.txt”)
>sc.textFile(“directory/*.txt”)
>sc.textFile(“hdfs://namenode:9000/path/file”)
Transformations – RDD: map
Pass each element of an RDD through a function
>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.map(lambda x: x%3)
Transformations – RDD :reduce
>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.reduce(add)
RDD Transformations: Reduce By Key
>>>rdd = sc.parallelize([(‘Alice’,23), (‘Bob’,17), (‘Alice’,27)])
>>>result_rdd = rdd.reduceByKey(add)
Some more RDD Transformations
rdd.flatMap(f): Return a new RDD by first applying a function to all elements of this RDD, and
then flattening the results.
>>> rdd = sc.parallelize([2,3,4])
sorted(rdd.flatMap(lambda x: range(1,x)).collect() ?/* Collect is the action applied on
transformation
[1,1,1,2,2,3]
 rdd.filter(f) : Return a new RDD containing only the elements that satisfy a predicate.
>>> rdd = sc.parallelize([1,2,3,4,5])
rdd.filter(lambda x: x%2 ==0).collect()
[2,4]
Some more RDD Transformations
sortBy(self, keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by given keyfunc
>> rdd= [(‘a’,1),(’b’, 2) , (‘1’,3), (‘d’,4),(‘2’,5) ]
rdd = sc.parallelize(rdd).sortBy(lambda x:x[0]).
 rdd.cache() : Cache RDD in memory for repeated use
countByKey(self):
rdd= sc.parallelize([ (“a”,1) , (“b”,1), (“a”,1) ])
rdd.countByKey().items()
 join(self,other, numPartitions = None): Return an RDD containing all pairs of elements with
matching keys in self and others.
Setting the level of parallelism
All the pair RDD operations take an optional second parameter for number of tasks.
> rdd.reduceByKey(lambda x, y: x + y, 5)
>rdd.groupByKey(5)
>rdd.join(pageViews, 5)
Some RDD Actions
RDD Transformations are lazily evaluated . Actions kick off computation on transformations.
Eg: Collect(), glom() etc
 rdd.collect() : Return RDD content as a list
rdd = sc.parallelize([1,2,3], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.glom().collect(): [1, 4, 9]
 rdd.glom().collect():
rdd = sc.parallelize([0,1,2], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.collect(): [[0], [1], [4]]
 saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported file system.
 take(n) :Return an array with the first n elements of the dataset.
 first() : return the first element of the dataset. Similar to take(1)
In Map Reduce you get only two operators – map and reduce.
Whereas Spark offers 80+ operations!
Automatic parallelization of workflows on SPARK. In Spark a whole series of
individual tasks is expressed as a single program flow that is lazily evaluated., so
that system has a complete picture of the execution graph.
Word Count
from pyspark import SparkContext
logFile = "hdfs://localhost:9000/user/bigdatavm/input"
sc = SparkContext("spark://bigdata-vm:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")
Fault tolerant - Persistent
RDDs track lineage information that can be used to efficiently re compute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
Spark will persist or cache RDD slices in memory on each node during operations.
You can mark an RDD to be persisted with the cache method on an RDD along with a storage level.
Spark Libraries
MLlib
 Spark subproject providing Machine Learning primitives. Initial contribution from AMP Lab @ UC Berkeley.
Shipped with Spark since version 0.8
35 contributors
Highlights include:
 Basic statistics - summary statistics ,correlation and stratified sampling
 hypothesis testing, random data generation
 Linear models of regression (logistic and linear regression, SVM’s)
 Naive Bayes and Decision Tree classifiers, ensemble of trees
Collaborative Filtering with ALS
 K-Means clustering and Gaussian Mixture.
 Stochastic gradient descent
 SVD (singular value decomposition) and PCA
Spark’s scalable machine
learning library consisting
of common learning
algorithms and utilities,
including classification,
regression, clustering,
collaborative filtering,
dimensionality reduction,
as well as underlying
optimization primitives.
Running a Spark Application
Command: submit-spark <python_file_path>
Let’s see the implementation of
1) K-Means
2) Logistic Regression
K-Means
# Import the required pyspark functions
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
from pyspark import SparkContext
sc =SparkContext()
data = sc.textFile("C:UserssnehachallaDownloadsspark-1.4.1-bin-hadoop2.4spark-1.4.1-bin-hadoop2.4binkmeans_data.txt")
parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')])).cache()
K-Means (Cont..)
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,runs = 1, initializationMode = "k-means||")
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
cost = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
Logistic Regression
from pyspark.mllib.classification import LogisticRegressionWithSGD
from numpy import array
# Load and parse the data
data = sc.textFile("mllib/data/sample_svm_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
model = LogisticRegressionWithSGD.train(parsedData)
# Build the model
labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),
model.predict(point.take(range(1, point.size)))))
# Evaluating the model on training data
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr)
Spark UI
Run Spark in local mode (pyspark) .
Spark UI is at http://localhost:4040
You will be able to see the RDD sizes and Identify slow running tasks.
MOVING TO A CLUSTER – EC2
Setting up a EMR Cluster
If your data is too large to compute on your local machine - then you’re in the right
place. An easy way to get Spark running is with EC2.
Create an account on aws.amazon.com
Get a keypair from aws Console: This is the security for your instance
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName
 Create EMR instance and configure the nodes:
https://console.aws.amazon.com/console/home?region=us-east-1#
 Launch EMR instance
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1
EMR Cluster
https://console.aws.amazon.com/s3/home?region=us-east-1
Data can be uploaded to the bucket on Amazon S3 .
For more info on Spark
 Website: http://spark.apache.org
Tutorials: http://ampcamp.berkeley.edu
 Spark Summit: http://spark-summit.org
Github: https://github.com/apache/spark
Mailing lists: user@spark.apache.org,dev@spark.apache.org
Python API documentation. http://spark.apache.org/docs/latest/api/python/
Questions?
THANK YOU
References
http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf
http://www.andrew.cmu.edu/user/amaurya/docs/spark_talk/presentation.pdf
http://www.slideshare.net/BenjaminBengfort/fast-data-analytics-with-spark-and-python

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLphanleson
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkWill Du
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 

Was ist angesagt? (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark core
Spark coreSpark core
Spark core
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 

Andere mochten auch

"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016René Pfitzner
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Edureka!
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemDatabricks
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaEdureka!
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksData Con LA
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaRISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaSpark Summit
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
 

Andere mochten auch (17)

"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
2016 spark survey
2016 spark survey2016 spark survey
2016 spark survey
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & Scala
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaRISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 

Ähnlich wie Using Apache Spark for Fast Data Processing and Machine Learning

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 

Ähnlich wie Using Apache Spark for Fast Data Processing and Machine Learning (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Data Science
Data ScienceData Science
Data Science
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Scala+data
Scala+dataScala+data
Scala+data
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 

Mehr von Sneha Challa

Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousingSneha Challa
 
Datawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentationDatawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentationSneha Challa
 
Best practices in Conveying complexity
Best practices in Conveying complexityBest practices in Conveying complexity
Best practices in Conveying complexitySneha Challa
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Sneha Challa
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Sneha Challa
 
How to perform a Open Heart Surgery
How to perform a Open Heart SurgeryHow to perform a Open Heart Surgery
How to perform a Open Heart SurgerySneha Challa
 

Mehr von Sneha Challa (6)

Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Datawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentationDatawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentation
 
Best practices in Conveying complexity
Best practices in Conveying complexityBest practices in Conveying complexity
Best practices in Conveying complexity
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?
 
How to perform a Open Heart Surgery
How to perform a Open Heart SurgeryHow to perform a Open Heart Surgery
How to perform a Open Heart Surgery
 

Kürzlich hochgeladen

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Kürzlich hochgeladen (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

Using Apache Spark for Fast Data Processing and Machine Learning

  • 1. Using Apache Spark PRESENTER: SNEHA CHALLA VENUE: GOOGLE, PITTSBURGH DATE: AUGUST 25TH, 2015
  • 2. What is Apache Spark? Spark is an open source computation engine built on top of the popular Hadoop Distributed File System (HDFS). Fast and general cluster computing engine for large scale data processing Efficient Usable Offers In memory computing DAG Execution Engine Up to 10× faster on disk, 100× in memory 2-5× less code Rich APIs in Java, Scala, Python Interactive shell
  • 3. Why Spark? Runs Everywhere • Standalone Mode (private cluster) • Apache Mesos – Cluster manager • Hadoop YARN • Amazon EC2 (prepared deployment)
  • 4. Spark Community Spark was initially developed by UC Berkeley, AMP Lab and is being used and developed in a wide variety of companies.  MOST ACTIVE OPEN SOURCE PROJECTS IN BIG DATA  More than 150 Contributors in the past One Year  25 + Companies Contributing and it’s growing. Spark was designed to both make traditional Map Reduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. Spark can be used to build fast end-to-end Machine Learning workflows.
  • 6. Elephant in the room- Map Reduce
  • 8. Spark VS Map Reduce  Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.  Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
  • 11. Spark Installation • Extract compressed folder spark-1.3.0-binhadoop2.4 • From terminal, go to spark-1.3.0-bin-hadoop2.4/ bin • Run pyspark • Run rdd = sc.parallelize([0, 1, 2]); rdd.map(lambda x: x*x).collect() • Get result [0,1,4] • It’s that easy!  Windows users might need to download and run additional winutils.exe for smooth running of applications  Download winutils here http://www.srccodes.com/p/article/39/error -util-shell-failed-locate-winutils-binary- hadoop-binary-path. and add to $HADOOP_HOME/bin • Download a bigger zip (1.9GB) from http://bit.ly/1FpZAXH
  • 12. Interactive Shell & Spark Context Interactive Shell: The Fastest Way to Learn Spark Available in Python and Scala Runs as an application on an existing Spark Cluster. OR Can run locally Spark Context: Main entry point to Spark functionality Available in shell as variable sc In standalone programs, you’d make your own by calling SC object(see later for details)
  • 13. Key Concepts – RDD Distributional Model RDD – Resilient Distributed Dataset Programs are written in terms of transformations on these Distributed Datasets • RDD = Resilient Distributed Database • Transformations convert one RDD into another • No actual calculation • Actions force calculation of result • Lazy evaluation
  • 14. RDD’s - Motivation RDDs are motivated by two types of applications that current data flow systems handle inefficiently: 1) Iterative algorithms - Common in graph applications and Machine Learning 2) Interactive Data Mining Tools To achieve fault tolerance efficiently - RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs. However, RDDs are expressive enough to capture a wide class of computations, including MapReduce and specialized computations.
  • 15. Spark Architecture Basic Abstraction: RDDs Immutability Laziness Transformations
  • 16. Programming Model Two types of operations on an RDD: • transformations • actions Transformations are lazily evaluated - they are not executed when you issue the command. RDDs are recomputed when an action is executed.
  • 17. Data distribution/partitioning RDD - read only collection of objects that are partitioned across a set of machines.
  • 18. RDD Computational Model • Operators on RDDs form a directed acyclic graph. • If any partition on dead workers is lost, it can be recomputed by retracing the operator DAG.
  • 19. FIRST STEP IN DATA ANALYSIS : Create an RDD Read data from a text file on local machine , S3 or HDFS into RDD. Give life to an RDD using Spark Context # Convert a python collection to an RDD # Turn a Python collection into an RDD >sc.parallelize ([7, 8, 9]) # Load text file from local FS, HDFS, or S3 >sc.textFile(“textfile.txt”) >sc.textFile(“directory/*.txt”) >sc.textFile(“hdfs://namenode:9000/path/file”)
  • 20. Transformations – RDD: map Pass each element of an RDD through a function >>>rdd = sc.parallelize(range(1,8)) >>>result_rdd = rdd.map(lambda x: x%3)
  • 21. Transformations – RDD :reduce >>>rdd = sc.parallelize(range(1,8)) >>>result_rdd = rdd.reduce(add)
  • 22. RDD Transformations: Reduce By Key >>>rdd = sc.parallelize([(‘Alice’,23), (‘Bob’,17), (‘Alice’,27)]) >>>result_rdd = rdd.reduceByKey(add)
  • 23. Some more RDD Transformations rdd.flatMap(f): Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. >>> rdd = sc.parallelize([2,3,4]) sorted(rdd.flatMap(lambda x: range(1,x)).collect() ?/* Collect is the action applied on transformation [1,1,1,2,2,3]  rdd.filter(f) : Return a new RDD containing only the elements that satisfy a predicate. >>> rdd = sc.parallelize([1,2,3,4,5]) rdd.filter(lambda x: x%2 ==0).collect() [2,4]
  • 24. Some more RDD Transformations sortBy(self, keyfunc, ascending=True, numPartitions=None) Sorts this RDD by given keyfunc >> rdd= [(‘a’,1),(’b’, 2) , (‘1’,3), (‘d’,4),(‘2’,5) ] rdd = sc.parallelize(rdd).sortBy(lambda x:x[0]).  rdd.cache() : Cache RDD in memory for repeated use countByKey(self): rdd= sc.parallelize([ (“a”,1) , (“b”,1), (“a”,1) ]) rdd.countByKey().items()  join(self,other, numPartitions = None): Return an RDD containing all pairs of elements with matching keys in self and others.
  • 25. Setting the level of parallelism All the pair RDD operations take an optional second parameter for number of tasks. > rdd.reduceByKey(lambda x, y: x + y, 5) >rdd.groupByKey(5) >rdd.join(pageViews, 5)
  • 26. Some RDD Actions RDD Transformations are lazily evaluated . Actions kick off computation on transformations. Eg: Collect(), glom() etc  rdd.collect() : Return RDD content as a list rdd = sc.parallelize([1,2,3], 3) rdd2 = rdd.map(lambda x: x*x) rdd2.glom().collect(): [1, 4, 9]  rdd.glom().collect(): rdd = sc.parallelize([0,1,2], 3) rdd2 = rdd.map(lambda x: x*x) rdd2.collect(): [[0], [1], [4]]  saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system.  take(n) :Return an array with the first n elements of the dataset.  first() : return the first element of the dataset. Similar to take(1)
  • 27. In Map Reduce you get only two operators – map and reduce. Whereas Spark offers 80+ operations! Automatic parallelization of workflows on SPARK. In Spark a whole series of individual tasks is expressed as a single program flow that is lazily evaluated., so that system has a complete picture of the execution graph.
  • 28. Word Count from pyspark import SparkContext logFile = "hdfs://localhost:9000/user/bigdatavm/input" sc = SparkContext("spark://bigdata-vm:7077", "WordCount") textFile = sc.textFile(logFile) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")
  • 29. Fault tolerant - Persistent RDDs track lineage information that can be used to efficiently re compute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) Spark will persist or cache RDD slices in memory on each node during operations. You can mark an RDD to be persisted with the cache method on an RDD along with a storage level.
  • 31. MLlib  Spark subproject providing Machine Learning primitives. Initial contribution from AMP Lab @ UC Berkeley. Shipped with Spark since version 0.8 35 contributors Highlights include:  Basic statistics - summary statistics ,correlation and stratified sampling  hypothesis testing, random data generation  Linear models of regression (logistic and linear regression, SVM’s)  Naive Bayes and Decision Tree classifiers, ensemble of trees Collaborative Filtering with ALS  K-Means clustering and Gaussian Mixture.  Stochastic gradient descent  SVD (singular value decomposition) and PCA Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
  • 32. Running a Spark Application Command: submit-spark <python_file_path> Let’s see the implementation of 1) K-Means 2) Logistic Regression
  • 33. K-Means # Import the required pyspark functions from pyspark.mllib.clustering import KMeans from numpy import array from math import sqrt from pyspark import SparkContext sc =SparkContext() data = sc.textFile("C:UserssnehachallaDownloadsspark-1.4.1-bin-hadoop2.4spark-1.4.1-bin-hadoop2.4binkmeans_data.txt") parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')])).cache()
  • 34. K-Means (Cont..) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations = 10,runs = 1, initializationMode = "k-means||") # Evaluate clustering by computing the sum of squared errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) cost = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost))
  • 35. Logistic Regression from pyspark.mllib.classification import LogisticRegressionWithSGD from numpy import array # Load and parse the data data = sc.textFile("mllib/data/sample_svm_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) model = LogisticRegressionWithSGD.train(parsedData) # Build the model labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)), model.predict(point.take(range(1, point.size))))) # Evaluating the model on training data trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) print("Training Error = " + str(trainErr)
  • 36. Spark UI Run Spark in local mode (pyspark) . Spark UI is at http://localhost:4040 You will be able to see the RDD sizes and Identify slow running tasks.
  • 37. MOVING TO A CLUSTER – EC2
  • 38. Setting up a EMR Cluster If your data is too large to compute on your local machine - then you’re in the right place. An easy way to get Spark running is with EC2. Create an account on aws.amazon.com Get a keypair from aws Console: This is the security for your instance https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName  Create EMR instance and configure the nodes: https://console.aws.amazon.com/console/home?region=us-east-1#  Launch EMR instance https://console.aws.amazon.com/ec2/v2/home?region=us-east-1
  • 40. For more info on Spark  Website: http://spark.apache.org Tutorials: http://ampcamp.berkeley.edu  Spark Summit: http://spark-summit.org Github: https://github.com/apache/spark Mailing lists: user@spark.apache.org,dev@spark.apache.org Python API documentation. http://spark.apache.org/docs/latest/api/python/

Hinweis der Redaktion

  1. MapReduce has been around as the major framework for distributed computing for 10 years - this is pretty old in technology time! Well known limitations include: 1. Programmability a. Requires multiple chained MR steps b. Specialized systems for applications 2. Performance a. Writes to disk between each computational step b. Expensive for apps to "reuse" data i. Iterative algorithms ii. Interactive analysis Most machine learning algorithms are iterative …
  2. Spark provides an efficient way for solving iterative algorithms by keeping the intermediate data in the memory. This avoids the overhead of R/W of the intermediate data from the disk as in the case of MR. Also, when running the same operation again and again, data can be cached/fetched from the memory without performing the same operation again. MR is stateless, lets say a program/application in MR has been executed 10 times, then the whole data set has to be scanned 10 times.
  3. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf
  4. Comprehensive list of actions: http://spark.apache.org/docs/latest/programming-guide.html#actions
  5. When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence. With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph. This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention.
  6. Spark allows you to access these operators in the context of a full programming language — thus, you can use control statements, functions, and classes as you would in a typical programming environment. Automatic Parallelization of Complex Flows When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence. With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph. This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention. This capability also has the property of enabling certain optimizations to the engine while reducing the burden on the application developer. Win, and win again!
  7. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data inHDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source The RDD data model and cached memory computing allow Spark to quickly and easily solve similar workflows and use cases that are part of Hadoop. Spark has a series of high level tools at it’s disposal that are added as component libraries, not integrated into the general computing framework:
  8. Know more here: http://spark.apache.org/docs/latest/mllib-guide.html