Using Apache Spark for Fast Data Processing and Machine Learning

Using Apache Spark
PRESENTER: SNEHA CHALLA
VENUE: GOOGLE, PITTSBURGH
DATE: AUGUST 25TH, 2015

What is Apache Spark?
Spark is an open source computation engine built on top of the popular Hadoop
Distributed File System (HDFS). Fast and general cluster computing engine for large scale
data processing
Efficient Usable
Offers In memory computing
DAG Execution Engine
Up to 10× faster on disk,
100× in memory 2-5× less code
Rich APIs in Java,
Scala, Python
Interactive shell

Why Spark? Runs Everywhere
• Standalone Mode (private
cluster)
• Apache Mesos – Cluster
manager
• Hadoop YARN
• Amazon EC2 (prepared
deployment)

Spark Community
Spark was initially developed by UC Berkeley, AMP Lab and is being used and developed in a wide
variety of companies.
 MOST ACTIVE OPEN SOURCE PROJECTS IN BIG DATA
 More than 150 Contributors in the past One Year
 25 + Companies Contributing and it’s growing.
Spark was designed to both make traditional Map Reduce programming easier and to support
new types of applications, with one of the earliest focus areas being machine learning. Spark can
be used to build fast end-to-end Machine Learning workflows.

Elephant in the room- Map Reduce

Spark VS Map Reduce
 Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
 Spark has an advanced DAG execution engine that supports
cyclic data flow and in-memory computing.

Spark Installation
http://spark.apache.org/downlo
ads.html

Spark Installation
• Extract compressed folder spark-1.3.0-binhadoop2.4
• From terminal, go to spark-1.3.0-bin-hadoop2.4/
bin
• Run pyspark
• Run rdd = sc.parallelize([0, 1, 2]);
rdd.map(lambda x: x*x).collect()
• Get result [0,1,4]
• It’s that easy!
 Windows users might need to download and
run additional winutils.exe for smooth
running of applications
 Download winutils here
http://www.srccodes.com/p/article/39/error
-util-shell-failed-locate-winutils-binary-
hadoop-binary-path.
and add to $HADOOP_HOME/bin
• Download a bigger zip (1.9GB) from
http://bit.ly/1FpZAXH

Interactive Shell & Spark Context
Interactive Shell:
The Fastest Way to Learn Spark
Available in Python and Scala
Runs as an application on an existing Spark Cluster.
OR Can run locally
Spark Context:
Main entry point to Spark functionality
Available in shell as variable sc
In standalone programs, you’d make your
own by calling SC object(see later for details)

Key Concepts – RDD Distributional Model
RDD – Resilient Distributed Dataset
Programs are written in terms of transformations on these Distributed Datasets
• RDD = Resilient
Distributed Database
• Transformations
convert one RDD into
another
• No actual calculation
• Actions force
calculation of result
• Lazy evaluation

RDD’s - Motivation
RDDs are motivated by two types of applications that current data flow systems handle
inefficiently:
1) Iterative algorithms - Common in graph applications and Machine Learning
2) Interactive Data Mining Tools
To achieve fault tolerance efficiently - RDDs provide a highly restricted form of shared memory:
they are read-only datasets that can only be constructed through bulk operations on other
RDDs.
However, RDDs are expressive enough to capture a wide class of computations, including
MapReduce and specialized computations.

Spark Architecture
Basic Abstraction: RDDs
Immutability
Laziness
Transformations

Programming Model
Two types of operations on an RDD:
• transformations
• actions
Transformations are lazily evaluated - they are not executed when you issue the
command.
RDDs are recomputed when an action is executed.

Data distribution/partitioning
RDD - read only collection of objects that are partitioned across a set of machines.

RDD Computational Model
• Operators on RDDs form a directed acyclic graph.
• If any partition on dead workers is lost, it can be recomputed by retracing the operator DAG.

FIRST STEP IN DATA ANALYSIS :
Create an RDD
Read data from a text file on local machine , S3 or HDFS into RDD.
Give life to an RDD using Spark Context
# Convert a python collection to an RDD
# Turn a Python collection into an RDD
>sc.parallelize ([7, 8, 9])
# Load text file from local FS, HDFS, or S3
>sc.textFile(“textfile.txt”)
>sc.textFile(“directory/*.txt”)
>sc.textFile(“hdfs://namenode:9000/path/file”)

Transformations – RDD: map
Pass each element of an RDD through a function
>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.map(lambda x: x%3)

Transformations – RDD :reduce
>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.reduce(add)

RDD Transformations: Reduce By Key
>>>rdd = sc.parallelize([(‘Alice’,23), (‘Bob’,17), (‘Alice’,27)])
>>>result_rdd = rdd.reduceByKey(add)

Some more RDD Transformations
rdd.flatMap(f): Return a new RDD by first applying a function to all elements of this RDD, and
then flattening the results.
>>> rdd = sc.parallelize([2,3,4])
sorted(rdd.flatMap(lambda x: range(1,x)).collect() ?/* Collect is the action applied on
transformation
[1,1,1,2,2,3]
 rdd.filter(f) : Return a new RDD containing only the elements that satisfy a predicate.
>>> rdd = sc.parallelize([1,2,3,4,5])
rdd.filter(lambda x: x%2 ==0).collect()
[2,4]

Some more RDD Transformations
sortBy(self, keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by given keyfunc
>> rdd= [(‘a’,1),(’b’, 2) , (‘1’,3), (‘d’,4),(‘2’,5) ]
rdd = sc.parallelize(rdd).sortBy(lambda x:x[0]).
 rdd.cache() : Cache RDD in memory for repeated use
countByKey(self):
rdd= sc.parallelize([ (“a”,1) , (“b”,1), (“a”,1) ])
rdd.countByKey().items()
 join(self,other, numPartitions = None): Return an RDD containing all pairs of elements with
matching keys in self and others.

Setting the level of parallelism
All the pair RDD operations take an optional second parameter for number of tasks.
> rdd.reduceByKey(lambda x, y: x + y, 5)
>rdd.groupByKey(5)
>rdd.join(pageViews, 5)

Some RDD Actions
RDD Transformations are lazily evaluated . Actions kick off computation on transformations.
Eg: Collect(), glom() etc
 rdd.collect() : Return RDD content as a list
rdd = sc.parallelize([1,2,3], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.glom().collect(): [1, 4, 9]
 rdd.glom().collect():
rdd = sc.parallelize([0,1,2], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.collect(): [[0], [1], [4]]
 saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported file system.
 take(n) :Return an array with the first n elements of the dataset.
 first() : return the first element of the dataset. Similar to take(1)

In Map Reduce you get only two operators – map and reduce.
Whereas Spark offers 80+ operations!
Automatic parallelization of workflows on SPARK. In Spark a whole series of
individual tasks is expressed as a single program flow that is lazily evaluated., so
that system has a complete picture of the execution graph.

Word Count
from pyspark import SparkContext
logFile = "hdfs://localhost:9000/user/bigdatavm/input"
sc = SparkContext("spark://bigdata-vm:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")

Fault tolerant - Persistent
RDDs track lineage information that can be used to efficiently re compute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
Spark will persist or cache RDD slices in memory on each node during operations.
You can mark an RDD to be persisted with the cache method on an RDD along with a storage level.

MLlib
 Spark subproject providing Machine Learning primitives. Initial contribution from AMP Lab @ UC Berkeley.
Shipped with Spark since version 0.8
35 contributors
Highlights include:
 Basic statistics - summary statistics ,correlation and stratified sampling
 hypothesis testing, random data generation
 Linear models of regression (logistic and linear regression, SVM’s)
 Naive Bayes and Decision Tree classifiers, ensemble of trees
Collaborative Filtering with ALS
 K-Means clustering and Gaussian Mixture.
 Stochastic gradient descent
 SVD (singular value decomposition) and PCA
Spark’s scalable machine
learning library consisting
of common learning
algorithms and utilities,
including classification,
regression, clustering,
collaborative filtering,
dimensionality reduction,
as well as underlying
optimization primitives.

Running a Spark Application
Command: submit-spark <python_file_path>
Let’s see the implementation of
1) K-Means
2) Logistic Regression

K-Means
# Import the required pyspark functions
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
from pyspark import SparkContext
sc =SparkContext()
data = sc.textFile("C:UserssnehachallaDownloadsspark-1.4.1-bin-hadoop2.4spark-1.4.1-bin-hadoop2.4binkmeans_data.txt")
parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')])).cache()

K-Means (Cont..)
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,runs = 1, initializationMode = "k-means||")
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
cost = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))

Logistic Regression
from pyspark.mllib.classification import LogisticRegressionWithSGD
from numpy import array
# Load and parse the data
data = sc.textFile("mllib/data/sample_svm_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
model = LogisticRegressionWithSGD.train(parsedData)
# Build the model
labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),
model.predict(point.take(range(1, point.size)))))
# Evaluating the model on training data
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr)

Spark UI
Run Spark in local mode (pyspark) .
Spark UI is at http://localhost:4040
You will be able to see the RDD sizes and Identify slow running tasks.

Setting up a EMR Cluster
If your data is too large to compute on your local machine - then you’re in the right
place. An easy way to get Spark running is with EC2.
Create an account on aws.amazon.com
Get a keypair from aws Console: This is the security for your instance
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#KeyPairs:sort=keyName
 Create EMR instance and configure the nodes:
https://console.aws.amazon.com/console/home?region=us-east-1#
 Launch EMR instance
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1

EMR Cluster
https://console.aws.amazon.com/s3/home?region=us-east-1
Data can be uploaded to the bucket on Amazon S3 .

For more info on Spark
 Website: http://spark.apache.org
Tutorials: http://ampcamp.berkeley.edu
 Spark Summit: http://spark-summit.org
Github: https://github.com/apache/spark
Mailing lists: user@spark.apache.org,dev@spark.apache.org
Python API documentation. http://spark.apache.org/docs/latest/api/python/

References
http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf
http://www.andrew.cmu.edu/user/amaurya/docs/spark_talk/presentation.pdf
http://www.slideshare.net/BenjaminBengfort/fast-data-analytics-with-spark-and-python

Using Apache Spark for Fast Data Processing and Machine Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Using Apache Spark for Fast Data Processing and Machine Learning

Ähnlich wie Using Apache Spark for Fast Data Processing and Machine Learning (20)

Mehr von Sneha Challa

Mehr von Sneha Challa (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Using Apache Spark for Fast Data Processing and Machine Learning

Hinweis der Redaktion