SlideShare ist ein Scribd-Unternehmen logo
1 von 63
1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.
Intro to Apache Spark
Training – University College London
September 8th, 2014
Suhas Gogate : Architect, Pivotal Hadoop Engg
Shivram Mani: Lead Engineer, Pivotal Hadoop Engg.
2© Copyright 2013 Pivotal. All rights reserved.
About Me: Suhas (https://www.linkedin.com/in/vgogate)
 Since 2008, active in Hadoop infrastructure and ecosystem components
– Worked with lead Hadoop technology based companies
– Yahoo, Netflix, Hortonworks, EMC-Greenplum/Pivotal
 Founder and PMC member/committer of the Apache Ambari project
 Contributed Apache “Hadoop Vaidya” – Performance diagnostics for M/R
 Prior to Hadoop,
– IBM Almaden Research (2000-2008), CS Software & Storage systems.
– In early days (1993) of my career, worked with a team that built first Indian
super computer, PARAM (Transputer based MPP system) at Center for
Development of Advance Computing (CDAC, Pune)
3© Copyright 2013 Pivotal. All rights reserved.
About Me: Shivram (https://www.linkedin.com/in/shivrammani)
 Since 2009, active user of Hadoop
– Yahoo, EMC-Greenplum/Pivotal
 Built the Pivotal Command Center (Cluster Configuration/Management)
 Lead developer for
– Pivotal Extension Framework
– Unified Storage System
 Prior to Hadoop,
– Yahoo Web Search Federation
– Yahoo Vertical Search Relevance
4© Copyright 2013 Pivotal. All rights reserved.
Abstract
Apache Spark is one of the most exciting and talked about ASF projects
today, but how should enterprise architects view it, and what type of impact
might it have on our platforms? This talk will introduce Spark and its core
concepts, the ecosystem of services on top of it, types of problems it can
solve, similarities and differences from Hadoop, deployment topologies,
and possible uses in enterprise. Concepts will be illustrated with a variety
of demos covering: the programming model, the development experience,
“realistic” infrastructure simulation with local virtual deployments, and
Spark cluster monitoring tools.
5© Copyright 2013 Pivotal. All rights reserved.
Day 1 (Sept 8th, 2014) – Agenda
 What is Spark,
– What does it have to do with Big Data/Hadoop?
 Spark Programming Model
 Spark Internals:
– Execution, Shuffles, Tasks, Stages
 Spark Deployment models
 Demo & Hands-on exercise
 Q/A
6© Copyright 2013 Pivotal. All rights reserved. 6© Copyright 2013 Pivotal. All rights reserved.
What Is Spark?
7© Copyright 2013 Pivotal. All rights reserved.
What is Spark?
 Distributed Compute Engine for analysis of large data sets, like Hadoop M/R
– Inspired by deficiencies in Hadoop M/R batch processing
▪ Data Replication, Serialization, Disk I/O etc.
– Effectively uses distributed cluster memory for faster computations
 A common framework primarily designed for following types of workloads,
– Iterative graph processing algorithms (Google Pregel)
– Iterative machine learning algorithms like Page Rank, K-means clustering, Logistic regression etc (HALoop)
– Interactive data mining – run multiple ad-hoc queries on the same data set
– Along with “batch” workloads like Hadoop M/R on data in memory
 Implementation of Resilient Distributed Dataset (RDD) in Scala
 Similar scalability and fault tolerance as Hadoop Map/Reduce
– Although uses different fault tolerance model of lineage to reconstitute data instead of replication
 Programmatic interface via API or Interactive
– Scala, Java7/8, Python
8© Copyright 2013 Pivotal. All rights reserved.
Spark is also …
 Came out of AMPLab project at UCB
 An ASF Top Level project
– https://spark.apache.org/
– https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-
projects-plugin:summary-panel
 An active community of ~100-200 contributors across 25-35 companies
– More active than Hadoop MapReduce
– 1000 people (the max) attended Spark Summit
– http://spark-summit.org
 Hadoop Compatible
9© Copyright 2013 Pivotal. All rights reserved.
Spark is not …
 An OLTP data store
 A “permanent” data store
 Or an application cache
It’s also not Mature enough compared to Hadoop
– This is a good thing. Lots of room to grow.
10© Copyright 2013 Pivotal. All rights reserved.
Spark is not Hadoop, but is compatible
 Often better than Hadoop
– M/R is fine for “Data Parallel”, but awkward for some workloads
– Low latency dispatch, Iterative, Streaming
 Natively accesses Hadoop data
– Data Locality
 Spark just another YARN job
– Utilizes current investments in Hadoop
– Brings Spark to the Data
 It’s not OR … it’s AND!
11© Copyright 2013 Pivotal. All rights reserved.
Improvements over Map/Reduce
 Efficiency
– General Execution Graphs (not just map->reduce->store)
– In memory
 Usability
– Rich APIs in Scala, Java, Python
– Interactive
Can Spark be the R for Big Data?
12© Copyright 2013 Pivotal. All rights reserved.
Short History
 2009 Started as research project at UCB
 2010 Open Sourced
 January 2011 AMPLab Created
 October 2012 0.6
– Java, Stand alone cluster, maven
 June 21 2013 Spark accepted into ASF Incubator
 Feb 27 2014 Spark becomes top level ASF project
 May 30 2014 Spark 1.0
 August 5th, 2014 Spark 1.0.2
13© Copyright 2013 Pivotal. All rights reserved. 13© Copyright 2013 Pivotal. All rights reserved.
Spark Programming
Model
RDDs in Detail
14© Copyright 2013 Pivotal. All rights reserved.
Spark Program Model (Scala)
 A driver program runs user’s main function, execute set of parallel
operations (transformations and actions), on a collection of
elements called Resilient, Distributed Dataset (RDD)
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val file = sc.textFile(“hdfs://.../logfile”)
val err_lines = file.filter(_.contains(“ERROR”))
val err_msgs = err_lines.map(MyFunc.extractMessage)
err_msgs.cache()
val errFile = err_msgs.saveAsTextFile(“hdfs://…/errlogfile”)
err_msgs.count()
RDDs
Actions
Transformation
s
15© Copyright 2013 Pivotal. All rights reserved.
Resilient Distributed Dataset (RDD)
 A new Data Type supported under spark framework extension to
– Scala, Java, Python
 A read-only collection of records partitioned across cluster nodes
– Does not support insert/update/delete of records from RDD
 Created by
– Reading file(s) from HDFS
– Parallelizing existing collections (lists, arrary, maps etc.)
– By executing transformations on existing RDDs
 RDDs can be persisted w/ following options
– In memory – Serialized or Non-serialized object (optionally replicated)
– On Disk – Serialized (optionally replicated)
– In memory file system like Tachyon
 RDDs store lineage information
– Support coarse grain recovery of whole partition upon node failure
16© Copyright 2013 Pivotal. All rights reserved.
Two Categories of Operations on RDD
 Transforms
– Create from stable storage (hdfs, tachyon, etc.)
– Generate RDD from other RDD (map, filter, groupBy)
– Lazy Operations that build a DAG of Tasks
– Once Spark knows your transformations it can build a plan
 Actions
– Return a result or write to storage (count, collect, save, etc.)
– Actions cause the DAG to execute (like Apache PIG)
17© Copyright 2013 Pivotal. All rights reserved.
Transformation and Actions
 Transformations
– map
– filter
– flatMap
– sample
– groupByKey
– reduceByKey
– union
– join
– sort
 Actions
– count
– collect
– reduce
– lookup
– save
18© Copyright 2013 Pivotal. All rights reserved.
Spark Shared Variables (btw’n Tasks & Driver)
 Broadcast variables
– Read only variable cached once on each node (not shipped w/ each task)
– Multiple tasks can refer to it that runs on the node
– Broadcast variable should be used in the program after it is created
– Original variable should not be modified after broadcast
val broadcastVar = sc.broadcast(Array(1, 2, 3))
 Accumulators
– Accumulator variables are like counters in M/R (i.e. tasks can add to it)
– Tasks can not read the value of accum variable, only driver program can read it
scala> val accum = sc.accumulator(0)
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
scala> accum.value
 Local variables of the transformation function are shipped along with the function to each node
– They can not be accessed by driver program
19© Copyright 2013 Pivotal. All rights reserved.
RDDs from External & Internal data sets
 Parallelize existing collection to create RDD
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data, num_partitions)
 Create RDD from external data source
– Support, local FS, HDFS, Cassandra, Hbase, Amazon S3
– Support, Text files, Sequence files, Hadoop InputFormats
– Local FS file path should be available and same on all the nodes
– Reading directories are supported
– textFile() by default makes one partition for each HDFS file block
scala> val distFile = sc.textFile("data.txt”, num_partitions)
20© Copyright 2013 Pivotal. All rights reserved.
RDD Persistence
 Persisting the result RDD after bunch of transformations is recommended
– This allows further actions on this result RDD much faster (no computations again)
– RDD.cache() – in memory persistence
 RDD Persistence options
– MEMORY_ONLY
▪ Store as deserialized Java objects. Does not fit in memory, some partitions will be recomputed
– MEMORY_AND_DISK
▪ Store as deserialized Java objects. Does not fit in memory, store additional partitions on disk
– MEMORY_ONLY_SER
▪ Store as serialized Java objects (one byte array per partition). More space-efficient
– MEMORY_AND_DISK_SER
▪ Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk
– DISK_ONLY
▪ Store the RDD partitions only on disk.
– MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.:
▪ Same as the levels above, but replicate each partition on two cluster nodes.
– OFF_HEAP (experimental) :
▪ Store RDD in serialized format in Tachyon, Reduce GC overhead, Share RDDs across Apps
21© Copyright 2013 Pivotal. All rights reserved.
How to choose RDD Persistence level
 Persistent levels trade-offs between memory usage and CPU efficiency.
 If your RDDs fit comfortably with the default storage level (MEMORY_ONLY)
 If not, try using MEMORY_ONLY_SER and selecting a fast serialization library
– Spilling to disk is costly (Spark by default uses Java serialization)
– Python uses by default Pickle library
– Scala/Java can use Kryo library, faster than default Java serialization
 Use the replicated storage levels if you want fast fault recovery
 In environments with high amounts of memory or multiple applications, the
experimental OFF_HEAP mode has several advantages:
– It allows multiple executors to share the same pool of memory in Tachyon.
– It significantly reduces garbage collection costs.
– Cached data is not lost if individual executors crash.
22© Copyright 2013 Pivotal. All rights reserved.
Spark Configuration
 Spark properties
– Control most application parameters and can be set by using a SparkConf
object, or through Java system properties
– Precedence
▪ SparkConf in driver program,
▪ spark submit options,
▪ Default conf file in spark conf directory
– http://spark.apache.org/docs/latest/configuration.html
 Environment variables
– Can be used to set per-machine settings, such as the IP address, through the
conf/spark-env.sh script on each node.
 Logging
– can be configured through log4j.properties
23© Copyright 2013 Pivotal. All rights reserved.
Spark Application Monitoring
 Web Interface
– Includes
▪ A list of scheduler stages and tasks
▪ A summary of RDD sizes and memory usage
▪ Environmental information.
▪ Information about the running executors
– Provide web-ui for running application at
▪ http://<driver-node>:4040
– Provide ability to view application information after it is finished
▪ Set spark.eventLog.enabled = true
▪ Set spark.eventLog.dir = file:///tmp/spark-events
▪ Run History Server: ./sbin/start-history-server.sh file:///tmp/spark-events
24© Copyright 2013 Pivotal. All rights reserved. 24© Copyright 2013 Pivotal. All rights reserved.
How Spark Runs
DAGs, shuffle’s, tasks, stages, etc.
25© Copyright 2013 Pivotal. All rights reserved.
Sample
26© Copyright 2013 Pivotal. All rights reserved.
What happens
 Create RDDs
 Pipeline operations as much of possible
– When a results doesn’t depend on other results, we can pipeline
– But, when data needs to be reorganized, no longer pipeline
 Stage is a merged operation
 Each stage gets a set of tasks
 Task is data and computation
27© Copyright 2013 Pivotal. All rights reserved.
RDDs and Stages
28© Copyright 2013 Pivotal. All rights reserved.
Tasks
29© Copyright 2013 Pivotal. All rights reserved.
Stages running
 Number of
partitions matter for
concurrency
 Rule of thumb is at
least 2x number of
cores
30© Copyright 2013 Pivotal. All rights reserved.
The Shuffle
 Redistributes data among
partitions
– Hash keys into buckets
– Pull not push
– Writes to intermediate
files to disk
– Becoming plugable
 Optimizations:
– Avoided when possible, if ”data is already properly" partitioned
– Partial aggregation reduces data movement
31© Copyright 2013 Pivotal. All rights reserved.
Other thought’s on Memory
 By default Spark owns 90% of the memory
 Partitions don’t have to fit in memory, but some things do
– EG: values for large sets in groupBy’s must fit in memory
 Shuffle memory is 30%
– If it goes over that, it’ll spill the data to disk
– Shuffle always writes to disk
 Turn on compression to keep objects serialized
– Saves space, but takes compute to serialize/de-serialize
32© Copyright 2013 Pivotal. All rights reserved. 32© Copyright 2013 Pivotal. All rights reserved.
Spark Deployment
modes
33© Copyright 2013 Pivotal. All rights reserved.
Spark Topology/Deployment modes
 Local
– Great for Dev
 Spark Cluster (master/slaves)
– Improving rapidly
 Cluster Resource Managers
– YARN
– MESOS
34© Copyright 2013 Pivotal. All rights reserved.
Spark Application Architecture
35© Copyright 2013 Pivotal. All rights reserved.
Yarn Cluster Mode
36© Copyright 2013 Pivotal. All rights reserved.
Yarn Client Mode
37© Copyright 2013 Pivotal. All rights reserved.
Comparison of Deployment Modes
38© Copyright 2013 Pivotal. All rights reserved. 38© Copyright 2013 Pivotal. All rights reserved.
Spark Hands-on
39© Copyright 2013 Pivotal. All rights reserved.
Spark Hands-on
 How to Build Spark ?
 Spark on Dev environment (mac)
 Spark on AWB
 Pyspark & Lamda Functions
 Spark Examples using Pyspark
 Web UI/Debugging
40© Copyright 2013 Pivotal. All rights reserved.
Spark Code/Build
 Spark Git Repsitory
– https://github.com/apache/spark
 Download Spark
– git clone https://github.com/apache/spark
 Build Spark
– Maven
– Shell script
41© Copyright 2013 Pivotal. All rights reserved.
Build Spark with Maven
 Setting up Maven’s Memory Usage
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -
XX:ReservedCodeCacheSize=512m”
 Specifying the Hadoop Version
# Apache Hadoop 2.2.X
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
# Apache Hadoop 2.3.X
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
# Apache Hadoop 2.4.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
42© Copyright 2013 Pivotal. All rights reserved.
Build Spark using Script
 Use make distribution script
 Abstraction using maven
– ./make-distribution.sh --skip-java-test -Pyarn -Phadoop-2.2 -
Phadoop.version=2.2.0
– Creates dist/ folder containing spark artifacts
– -- tgz creates spark distribution
43© Copyright 2013 Pivotal. All rights reserved.
Spark on AWB
 Access
– ssh manis2@acs04.analyticsworkbench.com -p 45326
 Topology
– https://portal.analyticsworkbench.com/projects/awbhome/wiki/Cluste
r_Topology
– Spark Admin: http://access3.ic.analyticsworkbench.com:4040
 Directory
– /usr/share/spark
 Spark using Yarn Cluster
44© Copyright 2013 Pivotal. All rights reserved.
Using Spark Submit
 Suitable for yarn-cluster mode
 export SPARK_SUBMIT_CLASSPATH= /usr/share/spark/jars/spark-assembly-1.0.0-
hadoop2.2.0-gphd-3.0.1.0.jar
 export HADOOP_CONF_DIR=/etc/gphd/hadoop/conf
 ./bin/spark-submit --class org.apache.spark.examples.SparkPi 
--master yarn-cluster 
--num-executors 3 
--driver-memory 1g 
--executor-memory 2g 
--executor-cores 1 
lib/spark-examples*.jar 10

45© Copyright 2013 Pivotal. All rights reserved.
Using Spark Shell
 Suitable for yarn-client mode
 Interactive shell suitable for debugging
 export SPARK_SUBMIT_CLASSPATH= /usr/share/spark/jars/spark-assembly-
1.0.0-hadoop2.2.0-gphd-3.0.1.0.jar
 bin/pyspark –master yarn-client --num-executors 2
46© Copyright 2013 Pivotal. All rights reserved.
Spark Logging
 Spark Logs contains the following information
– # of Partitions
– Size of tasks
– Nodes tasks are running
– Progress of tasks
 Yarn Container Logs
– Available locally
– Aggregated on HDFS
47© Copyright 2013 Pivotal. All rights reserved.
Spark Web Interface
 Web UI: http://<driver-node>:4040
 One port for each application (aka SparkContext)
 Shows the following information
– List of scheduler stages & tasks
– Summary of RDD sizes & memory usage
– Environmental information
– Information about running executors
 Provide ability to view application information after it is finished
– Set spark.eventLog.enabled = true
– Set spark.eventLog.dir = file:///tmp/spark-events
– Run History Server: ./sbin/start-history-server.sh file:///tmp/spark-events
48© Copyright 2013 Pivotal. All rights reserved.
Spark Examples
 Line with most words
 Lines with a particular word
 Word count
 Sorting
 PageRank
49© Copyright 2013 Pivotal. All rights reserved. 49© Copyright 2013 Pivotal. All rights reserved.
Berkley Data Stack -
Related Projects
Things that use Spark Core
50© Copyright 2013 Pivotal. All rights reserved.
Berkley Data Analytics Stack (BDAS)
Support
 Batch
 Streaming
 Interactive
Make it easy to
compose them
https://amplab.cs.berkeley.edu/software/
51© Copyright 2013 Pivotal. All rights reserved.
Spark SQL
 Lib in Spark Core that models RDDs as relations
– SchemaRDD
 Replaces Shark
– Lighter weight version with no code from Hive
 Import/Export in different Storage formats
– Parquet, learn schema from existing Hive warehouse
 Takes columnar storage from Shark
52© Copyright 2013 Pivotal. All rights reserved.
Spark Streaming
 Extend Spark to do large scale stream processing
– 100s of nodes with second scale end to end latency
 Simple, batch like API with RDDs
 Single semantics for both real time and high latency
 Other features
– Window-based Transformations
– Arbitrary join of streams
53© Copyright 2013 Pivotal. All rights reserved.
Streaming (cont)
 Input is broken up into Batches that become RDDs
 RDD’s are composed into DAGs to generate output
 Raw data is replicated in-memory for FT
54© Copyright 2013 Pivotal. All rights reserved.
GraphX (Alpha)
 Graph processing library
– Replaces Spark Bagel
 Graph Parallel not Data Parallel
– Reason in the context of neighbors
– GraphLab API
 Graph Creation => Algorithm => Post Processing
– Existing systems mainly deal with the Algorithm and not interactive
– Unify collection and graph models
55© Copyright 2013 Pivotal. All rights reserved.
MLbase
 Machine Learning toolset
– Library and higher level abstractions
 General tool in space is MatLab
– Difficult for end users to learn, debug, scale solutions
 Starting with MLlib
– Low level Distributed Machine Learning Library
 Many different Algorithms
– Classification, Regression, Collaborative Filtering, etc.
56© Copyright 2013 Pivotal. All rights reserved. 56© Copyright 2013 Pivotal. All rights reserved.
Thanks!
57© Copyright 2013 Pivotal. All rights reserved.
Data Science Platform
IMDG
Cluster Manager
RDDM/R
Application Platform
Stream
Server
MPP
SQL
Data Lake / HDFS / Virtual Storage
App Data Platform
SQL
Objects
JSON GemFireX
D
...ETC
Hadoop HDFS Isilon
App Dev / Ops
YARN
Mesos
MLbaseStreaming
Legacy
Systems
Legacy
Data Scientists/AnalystsData Sources End Users
SparkSQL
58© Copyright 2013 Pivotal. All rights reserved. 58© Copyright 2013 Pivotal. All rights reserved.
Backup Slides
59© Copyright 2013 Pivotal. All rights reserved.
PHD
General Solution Pipeline
Streaming Ingest
GemFire
(IMDB)
Machine
data
Stream
message
Source
RabbitMQ
Transport
HDFS
Sink
GemFire
Tap
SQL
REST
API
Analytics –
Counters and
Gauges
Message
Transformer
Analytics
Taps
HDFS
Dashboard
60© Copyright 2013 Pivotal. All rights reserved.
PHD
Where’s Spark?
Streaming Ingest
GemFire
(IMDB)
Machine
data
Stream
message
Source
Transport
HDFS
Sink
GemFire
Tap
SQL
REST
API
Analytics –
Counters and
Gauges
Message
Transformer
Analytics
Taps
HDFS
Dashboard
61© Copyright 2013 Pivotal. All rights reserved.
Slides by Shivram
 How to download/build and run spark examples
– Build w/ specific version of Hadoop (PHD?)
▪ http://spark.apache.org/docs/latest/building-with-maven.html
 Spark cluster deployment modes
– YARN, & Singlenode ( EC2, Mesos, Standalone)
– http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-
yarn-app-models/
– http://spark.apache.org/docs/latest/running-on-yarn.html
– Submit apps vs interactive shell
 Explain simple spark example and run it
▪ run-example, spark-shell/pyspark, spark-submitt,
▪ http://spark.apache.org/docs/latest/quick-start.html
62© Copyright 2013 Pivotal. All rights reserved.
Deployment modes (+YARN slide)
 Local
– Great for Dev
 Spark Cluster (master/slaves)
– Improving rapidly
 Cluster Resource Managers
– YARN
– MESOS
63© Copyright 2013 Pivotal. All rights reserved.
 Intro to Spark based Projects
– Spark SQL
– Spark Streaming
– MLBase
– GraphX

Weitere ähnliche Inhalte

Was ist angesagt?

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 

Was ist angesagt? (20)

Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 

Andere mochten auch

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 

Andere mochten auch (20)

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
 
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
 

Ähnlich wie Apache Spark Introduction @ University College London

Ähnlich wie Apache Spark Introduction @ University College London (20)

Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724
 
Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)
 
Spark 101
Spark 101Spark 101
Spark 101
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Apache spark
Apache sparkApache spark
Apache spark
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 

Kürzlich hochgeladen

VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
amitlee9823
 
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in DammamAbortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
ahmedjiabur940
 
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
amitlee9823
 
CHEAP Call Girls in Mayapuri (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Mayapuri  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Mayapuri  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Mayapuri (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
amitlee9823
 
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
motiram463
 
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
tufbav
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
gajnagarg
 

Kürzlich hochgeladen (20)

VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
 
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
 
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Call Girls Chickpet ☎ 7737669865☎ Book Your One night Stand (Bangalore)
 
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Bommasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Top Rated Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated  Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Top Rated  Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
 
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in DammamAbortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
Abortion Pill for sale in Riyadh ((+918761049707) Get Cytotec in Dammam
 
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
 
CHEAP Call Girls in Mayapuri (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Mayapuri  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Mayapuri  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Mayapuri (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
 
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort GirlsDeira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
 
SM-N975F esquematico completo - reparación.pdf
SM-N975F esquematico completo - reparación.pdfSM-N975F esquematico completo - reparación.pdf
SM-N975F esquematico completo - reparación.pdf
 
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
 
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
 
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In RT Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
 
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
 
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
Just Call Vip call girls daman Escorts ☎️9352988975 Two shot with one girl (d...
 

Apache Spark Introduction @ University College London

  • 1. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. Intro to Apache Spark Training – University College London September 8th, 2014 Suhas Gogate : Architect, Pivotal Hadoop Engg Shivram Mani: Lead Engineer, Pivotal Hadoop Engg.
  • 2. 2© Copyright 2013 Pivotal. All rights reserved. About Me: Suhas (https://www.linkedin.com/in/vgogate)  Since 2008, active in Hadoop infrastructure and ecosystem components – Worked with lead Hadoop technology based companies – Yahoo, Netflix, Hortonworks, EMC-Greenplum/Pivotal  Founder and PMC member/committer of the Apache Ambari project  Contributed Apache “Hadoop Vaidya” – Performance diagnostics for M/R  Prior to Hadoop, – IBM Almaden Research (2000-2008), CS Software & Storage systems. – In early days (1993) of my career, worked with a team that built first Indian super computer, PARAM (Transputer based MPP system) at Center for Development of Advance Computing (CDAC, Pune)
  • 3. 3© Copyright 2013 Pivotal. All rights reserved. About Me: Shivram (https://www.linkedin.com/in/shivrammani)  Since 2009, active user of Hadoop – Yahoo, EMC-Greenplum/Pivotal  Built the Pivotal Command Center (Cluster Configuration/Management)  Lead developer for – Pivotal Extension Framework – Unified Storage System  Prior to Hadoop, – Yahoo Web Search Federation – Yahoo Vertical Search Relevance
  • 4. 4© Copyright 2013 Pivotal. All rights reserved. Abstract Apache Spark is one of the most exciting and talked about ASF projects today, but how should enterprise architects view it, and what type of impact might it have on our platforms? This talk will introduce Spark and its core concepts, the ecosystem of services on top of it, types of problems it can solve, similarities and differences from Hadoop, deployment topologies, and possible uses in enterprise. Concepts will be illustrated with a variety of demos covering: the programming model, the development experience, “realistic” infrastructure simulation with local virtual deployments, and Spark cluster monitoring tools.
  • 5. 5© Copyright 2013 Pivotal. All rights reserved. Day 1 (Sept 8th, 2014) – Agenda  What is Spark, – What does it have to do with Big Data/Hadoop?  Spark Programming Model  Spark Internals: – Execution, Shuffles, Tasks, Stages  Spark Deployment models  Demo & Hands-on exercise  Q/A
  • 6. 6© Copyright 2013 Pivotal. All rights reserved. 6© Copyright 2013 Pivotal. All rights reserved. What Is Spark?
  • 7. 7© Copyright 2013 Pivotal. All rights reserved. What is Spark?  Distributed Compute Engine for analysis of large data sets, like Hadoop M/R – Inspired by deficiencies in Hadoop M/R batch processing ▪ Data Replication, Serialization, Disk I/O etc. – Effectively uses distributed cluster memory for faster computations  A common framework primarily designed for following types of workloads, – Iterative graph processing algorithms (Google Pregel) – Iterative machine learning algorithms like Page Rank, K-means clustering, Logistic regression etc (HALoop) – Interactive data mining – run multiple ad-hoc queries on the same data set – Along with “batch” workloads like Hadoop M/R on data in memory  Implementation of Resilient Distributed Dataset (RDD) in Scala  Similar scalability and fault tolerance as Hadoop Map/Reduce – Although uses different fault tolerance model of lineage to reconstitute data instead of replication  Programmatic interface via API or Interactive – Scala, Java7/8, Python
  • 8. 8© Copyright 2013 Pivotal. All rights reserved. Spark is also …  Came out of AMPLab project at UCB  An ASF Top Level project – https://spark.apache.org/ – https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira- projects-plugin:summary-panel  An active community of ~100-200 contributors across 25-35 companies – More active than Hadoop MapReduce – 1000 people (the max) attended Spark Summit – http://spark-summit.org  Hadoop Compatible
  • 9. 9© Copyright 2013 Pivotal. All rights reserved. Spark is not …  An OLTP data store  A “permanent” data store  Or an application cache It’s also not Mature enough compared to Hadoop – This is a good thing. Lots of room to grow.
  • 10. 10© Copyright 2013 Pivotal. All rights reserved. Spark is not Hadoop, but is compatible  Often better than Hadoop – M/R is fine for “Data Parallel”, but awkward for some workloads – Low latency dispatch, Iterative, Streaming  Natively accesses Hadoop data – Data Locality  Spark just another YARN job – Utilizes current investments in Hadoop – Brings Spark to the Data  It’s not OR … it’s AND!
  • 11. 11© Copyright 2013 Pivotal. All rights reserved. Improvements over Map/Reduce  Efficiency – General Execution Graphs (not just map->reduce->store) – In memory  Usability – Rich APIs in Scala, Java, Python – Interactive Can Spark be the R for Big Data?
  • 12. 12© Copyright 2013 Pivotal. All rights reserved. Short History  2009 Started as research project at UCB  2010 Open Sourced  January 2011 AMPLab Created  October 2012 0.6 – Java, Stand alone cluster, maven  June 21 2013 Spark accepted into ASF Incubator  Feb 27 2014 Spark becomes top level ASF project  May 30 2014 Spark 1.0  August 5th, 2014 Spark 1.0.2
  • 13. 13© Copyright 2013 Pivotal. All rights reserved. 13© Copyright 2013 Pivotal. All rights reserved. Spark Programming Model RDDs in Detail
  • 14. 14© Copyright 2013 Pivotal. All rights reserved. Spark Program Model (Scala)  A driver program runs user’s main function, execute set of parallel operations (transformations and actions), on a collection of elements called Resilient, Distributed Dataset (RDD) val conf = new SparkConf().setAppName(appName).setMaster(master) val sc = new SparkContext(conf) val file = sc.textFile(“hdfs://.../logfile”) val err_lines = file.filter(_.contains(“ERROR”)) val err_msgs = err_lines.map(MyFunc.extractMessage) err_msgs.cache() val errFile = err_msgs.saveAsTextFile(“hdfs://…/errlogfile”) err_msgs.count() RDDs Actions Transformation s
  • 15. 15© Copyright 2013 Pivotal. All rights reserved. Resilient Distributed Dataset (RDD)  A new Data Type supported under spark framework extension to – Scala, Java, Python  A read-only collection of records partitioned across cluster nodes – Does not support insert/update/delete of records from RDD  Created by – Reading file(s) from HDFS – Parallelizing existing collections (lists, arrary, maps etc.) – By executing transformations on existing RDDs  RDDs can be persisted w/ following options – In memory – Serialized or Non-serialized object (optionally replicated) – On Disk – Serialized (optionally replicated) – In memory file system like Tachyon  RDDs store lineage information – Support coarse grain recovery of whole partition upon node failure
  • 16. 16© Copyright 2013 Pivotal. All rights reserved. Two Categories of Operations on RDD  Transforms – Create from stable storage (hdfs, tachyon, etc.) – Generate RDD from other RDD (map, filter, groupBy) – Lazy Operations that build a DAG of Tasks – Once Spark knows your transformations it can build a plan  Actions – Return a result or write to storage (count, collect, save, etc.) – Actions cause the DAG to execute (like Apache PIG)
  • 17. 17© Copyright 2013 Pivotal. All rights reserved. Transformation and Actions  Transformations – map – filter – flatMap – sample – groupByKey – reduceByKey – union – join – sort  Actions – count – collect – reduce – lookup – save
  • 18. 18© Copyright 2013 Pivotal. All rights reserved. Spark Shared Variables (btw’n Tasks & Driver)  Broadcast variables – Read only variable cached once on each node (not shipped w/ each task) – Multiple tasks can refer to it that runs on the node – Broadcast variable should be used in the program after it is created – Original variable should not be modified after broadcast val broadcastVar = sc.broadcast(Array(1, 2, 3))  Accumulators – Accumulator variables are like counters in M/R (i.e. tasks can add to it) – Tasks can not read the value of accum variable, only driver program can read it scala> val accum = sc.accumulator(0) scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) scala> accum.value  Local variables of the transformation function are shipped along with the function to each node – They can not be accessed by driver program
  • 19. 19© Copyright 2013 Pivotal. All rights reserved. RDDs from External & Internal data sets  Parallelize existing collection to create RDD val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data, num_partitions)  Create RDD from external data source – Support, local FS, HDFS, Cassandra, Hbase, Amazon S3 – Support, Text files, Sequence files, Hadoop InputFormats – Local FS file path should be available and same on all the nodes – Reading directories are supported – textFile() by default makes one partition for each HDFS file block scala> val distFile = sc.textFile("data.txt”, num_partitions)
  • 20. 20© Copyright 2013 Pivotal. All rights reserved. RDD Persistence  Persisting the result RDD after bunch of transformations is recommended – This allows further actions on this result RDD much faster (no computations again) – RDD.cache() – in memory persistence  RDD Persistence options – MEMORY_ONLY ▪ Store as deserialized Java objects. Does not fit in memory, some partitions will be recomputed – MEMORY_AND_DISK ▪ Store as deserialized Java objects. Does not fit in memory, store additional partitions on disk – MEMORY_ONLY_SER ▪ Store as serialized Java objects (one byte array per partition). More space-efficient – MEMORY_AND_DISK_SER ▪ Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk – DISK_ONLY ▪ Store the RDD partitions only on disk. – MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.: ▪ Same as the levels above, but replicate each partition on two cluster nodes. – OFF_HEAP (experimental) : ▪ Store RDD in serialized format in Tachyon, Reduce GC overhead, Share RDDs across Apps
  • 21. 21© Copyright 2013 Pivotal. All rights reserved. How to choose RDD Persistence level  Persistent levels trade-offs between memory usage and CPU efficiency.  If your RDDs fit comfortably with the default storage level (MEMORY_ONLY)  If not, try using MEMORY_ONLY_SER and selecting a fast serialization library – Spilling to disk is costly (Spark by default uses Java serialization) – Python uses by default Pickle library – Scala/Java can use Kryo library, faster than default Java serialization  Use the replicated storage levels if you want fast fault recovery  In environments with high amounts of memory or multiple applications, the experimental OFF_HEAP mode has several advantages: – It allows multiple executors to share the same pool of memory in Tachyon. – It significantly reduces garbage collection costs. – Cached data is not lost if individual executors crash.
  • 22. 22© Copyright 2013 Pivotal. All rights reserved. Spark Configuration  Spark properties – Control most application parameters and can be set by using a SparkConf object, or through Java system properties – Precedence ▪ SparkConf in driver program, ▪ spark submit options, ▪ Default conf file in spark conf directory – http://spark.apache.org/docs/latest/configuration.html  Environment variables – Can be used to set per-machine settings, such as the IP address, through the conf/spark-env.sh script on each node.  Logging – can be configured through log4j.properties
  • 23. 23© Copyright 2013 Pivotal. All rights reserved. Spark Application Monitoring  Web Interface – Includes ▪ A list of scheduler stages and tasks ▪ A summary of RDD sizes and memory usage ▪ Environmental information. ▪ Information about the running executors – Provide web-ui for running application at ▪ http://<driver-node>:4040 – Provide ability to view application information after it is finished ▪ Set spark.eventLog.enabled = true ▪ Set spark.eventLog.dir = file:///tmp/spark-events ▪ Run History Server: ./sbin/start-history-server.sh file:///tmp/spark-events
  • 24. 24© Copyright 2013 Pivotal. All rights reserved. 24© Copyright 2013 Pivotal. All rights reserved. How Spark Runs DAGs, shuffle’s, tasks, stages, etc.
  • 25. 25© Copyright 2013 Pivotal. All rights reserved. Sample
  • 26. 26© Copyright 2013 Pivotal. All rights reserved. What happens  Create RDDs  Pipeline operations as much of possible – When a results doesn’t depend on other results, we can pipeline – But, when data needs to be reorganized, no longer pipeline  Stage is a merged operation  Each stage gets a set of tasks  Task is data and computation
  • 27. 27© Copyright 2013 Pivotal. All rights reserved. RDDs and Stages
  • 28. 28© Copyright 2013 Pivotal. All rights reserved. Tasks
  • 29. 29© Copyright 2013 Pivotal. All rights reserved. Stages running  Number of partitions matter for concurrency  Rule of thumb is at least 2x number of cores
  • 30. 30© Copyright 2013 Pivotal. All rights reserved. The Shuffle  Redistributes data among partitions – Hash keys into buckets – Pull not push – Writes to intermediate files to disk – Becoming plugable  Optimizations: – Avoided when possible, if ”data is already properly" partitioned – Partial aggregation reduces data movement
  • 31. 31© Copyright 2013 Pivotal. All rights reserved. Other thought’s on Memory  By default Spark owns 90% of the memory  Partitions don’t have to fit in memory, but some things do – EG: values for large sets in groupBy’s must fit in memory  Shuffle memory is 30% – If it goes over that, it’ll spill the data to disk – Shuffle always writes to disk  Turn on compression to keep objects serialized – Saves space, but takes compute to serialize/de-serialize
  • 32. 32© Copyright 2013 Pivotal. All rights reserved. 32© Copyright 2013 Pivotal. All rights reserved. Spark Deployment modes
  • 33. 33© Copyright 2013 Pivotal. All rights reserved. Spark Topology/Deployment modes  Local – Great for Dev  Spark Cluster (master/slaves) – Improving rapidly  Cluster Resource Managers – YARN – MESOS
  • 34. 34© Copyright 2013 Pivotal. All rights reserved. Spark Application Architecture
  • 35. 35© Copyright 2013 Pivotal. All rights reserved. Yarn Cluster Mode
  • 36. 36© Copyright 2013 Pivotal. All rights reserved. Yarn Client Mode
  • 37. 37© Copyright 2013 Pivotal. All rights reserved. Comparison of Deployment Modes
  • 38. 38© Copyright 2013 Pivotal. All rights reserved. 38© Copyright 2013 Pivotal. All rights reserved. Spark Hands-on
  • 39. 39© Copyright 2013 Pivotal. All rights reserved. Spark Hands-on  How to Build Spark ?  Spark on Dev environment (mac)  Spark on AWB  Pyspark & Lamda Functions  Spark Examples using Pyspark  Web UI/Debugging
  • 40. 40© Copyright 2013 Pivotal. All rights reserved. Spark Code/Build  Spark Git Repsitory – https://github.com/apache/spark  Download Spark – git clone https://github.com/apache/spark  Build Spark – Maven – Shell script
  • 41. 41© Copyright 2013 Pivotal. All rights reserved. Build Spark with Maven  Setting up Maven’s Memory Usage export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M - XX:ReservedCodeCacheSize=512m”  Specifying the Hadoop Version # Apache Hadoop 2.2.X mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package # Apache Hadoop 2.3.X mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package # Apache Hadoop 2.4.X mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
  • 42. 42© Copyright 2013 Pivotal. All rights reserved. Build Spark using Script  Use make distribution script  Abstraction using maven – ./make-distribution.sh --skip-java-test -Pyarn -Phadoop-2.2 - Phadoop.version=2.2.0 – Creates dist/ folder containing spark artifacts – -- tgz creates spark distribution
  • 43. 43© Copyright 2013 Pivotal. All rights reserved. Spark on AWB  Access – ssh manis2@acs04.analyticsworkbench.com -p 45326  Topology – https://portal.analyticsworkbench.com/projects/awbhome/wiki/Cluste r_Topology – Spark Admin: http://access3.ic.analyticsworkbench.com:4040  Directory – /usr/share/spark  Spark using Yarn Cluster
  • 44. 44© Copyright 2013 Pivotal. All rights reserved. Using Spark Submit  Suitable for yarn-cluster mode  export SPARK_SUBMIT_CLASSPATH= /usr/share/spark/jars/spark-assembly-1.0.0- hadoop2.2.0-gphd-3.0.1.0.jar  export HADOOP_CONF_DIR=/etc/gphd/hadoop/conf  ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 1g --executor-memory 2g --executor-cores 1 lib/spark-examples*.jar 10 
  • 45. 45© Copyright 2013 Pivotal. All rights reserved. Using Spark Shell  Suitable for yarn-client mode  Interactive shell suitable for debugging  export SPARK_SUBMIT_CLASSPATH= /usr/share/spark/jars/spark-assembly- 1.0.0-hadoop2.2.0-gphd-3.0.1.0.jar  bin/pyspark –master yarn-client --num-executors 2
  • 46. 46© Copyright 2013 Pivotal. All rights reserved. Spark Logging  Spark Logs contains the following information – # of Partitions – Size of tasks – Nodes tasks are running – Progress of tasks  Yarn Container Logs – Available locally – Aggregated on HDFS
  • 47. 47© Copyright 2013 Pivotal. All rights reserved. Spark Web Interface  Web UI: http://<driver-node>:4040  One port for each application (aka SparkContext)  Shows the following information – List of scheduler stages & tasks – Summary of RDD sizes & memory usage – Environmental information – Information about running executors  Provide ability to view application information after it is finished – Set spark.eventLog.enabled = true – Set spark.eventLog.dir = file:///tmp/spark-events – Run History Server: ./sbin/start-history-server.sh file:///tmp/spark-events
  • 48. 48© Copyright 2013 Pivotal. All rights reserved. Spark Examples  Line with most words  Lines with a particular word  Word count  Sorting  PageRank
  • 49. 49© Copyright 2013 Pivotal. All rights reserved. 49© Copyright 2013 Pivotal. All rights reserved. Berkley Data Stack - Related Projects Things that use Spark Core
  • 50. 50© Copyright 2013 Pivotal. All rights reserved. Berkley Data Analytics Stack (BDAS) Support  Batch  Streaming  Interactive Make it easy to compose them https://amplab.cs.berkeley.edu/software/
  • 51. 51© Copyright 2013 Pivotal. All rights reserved. Spark SQL  Lib in Spark Core that models RDDs as relations – SchemaRDD  Replaces Shark – Lighter weight version with no code from Hive  Import/Export in different Storage formats – Parquet, learn schema from existing Hive warehouse  Takes columnar storage from Shark
  • 52. 52© Copyright 2013 Pivotal. All rights reserved. Spark Streaming  Extend Spark to do large scale stream processing – 100s of nodes with second scale end to end latency  Simple, batch like API with RDDs  Single semantics for both real time and high latency  Other features – Window-based Transformations – Arbitrary join of streams
  • 53. 53© Copyright 2013 Pivotal. All rights reserved. Streaming (cont)  Input is broken up into Batches that become RDDs  RDD’s are composed into DAGs to generate output  Raw data is replicated in-memory for FT
  • 54. 54© Copyright 2013 Pivotal. All rights reserved. GraphX (Alpha)  Graph processing library – Replaces Spark Bagel  Graph Parallel not Data Parallel – Reason in the context of neighbors – GraphLab API  Graph Creation => Algorithm => Post Processing – Existing systems mainly deal with the Algorithm and not interactive – Unify collection and graph models
  • 55. 55© Copyright 2013 Pivotal. All rights reserved. MLbase  Machine Learning toolset – Library and higher level abstractions  General tool in space is MatLab – Difficult for end users to learn, debug, scale solutions  Starting with MLlib – Low level Distributed Machine Learning Library  Many different Algorithms – Classification, Regression, Collaborative Filtering, etc.
  • 56. 56© Copyright 2013 Pivotal. All rights reserved. 56© Copyright 2013 Pivotal. All rights reserved. Thanks!
  • 57. 57© Copyright 2013 Pivotal. All rights reserved. Data Science Platform IMDG Cluster Manager RDDM/R Application Platform Stream Server MPP SQL Data Lake / HDFS / Virtual Storage App Data Platform SQL Objects JSON GemFireX D ...ETC Hadoop HDFS Isilon App Dev / Ops YARN Mesos MLbaseStreaming Legacy Systems Legacy Data Scientists/AnalystsData Sources End Users SparkSQL
  • 58. 58© Copyright 2013 Pivotal. All rights reserved. 58© Copyright 2013 Pivotal. All rights reserved. Backup Slides
  • 59. 59© Copyright 2013 Pivotal. All rights reserved. PHD General Solution Pipeline Streaming Ingest GemFire (IMDB) Machine data Stream message Source RabbitMQ Transport HDFS Sink GemFire Tap SQL REST API Analytics – Counters and Gauges Message Transformer Analytics Taps HDFS Dashboard
  • 60. 60© Copyright 2013 Pivotal. All rights reserved. PHD Where’s Spark? Streaming Ingest GemFire (IMDB) Machine data Stream message Source Transport HDFS Sink GemFire Tap SQL REST API Analytics – Counters and Gauges Message Transformer Analytics Taps HDFS Dashboard
  • 61. 61© Copyright 2013 Pivotal. All rights reserved. Slides by Shivram  How to download/build and run spark examples – Build w/ specific version of Hadoop (PHD?) ▪ http://spark.apache.org/docs/latest/building-with-maven.html  Spark cluster deployment modes – YARN, & Singlenode ( EC2, Mesos, Standalone) – http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and- yarn-app-models/ – http://spark.apache.org/docs/latest/running-on-yarn.html – Submit apps vs interactive shell  Explain simple spark example and run it ▪ run-example, spark-shell/pyspark, spark-submitt, ▪ http://spark.apache.org/docs/latest/quick-start.html
  • 62. 62© Copyright 2013 Pivotal. All rights reserved. Deployment modes (+YARN slide)  Local – Great for Dev  Spark Cluster (master/slaves) – Improving rapidly  Cluster Resource Managers – YARN – MESOS
  • 63. 63© Copyright 2013 Pivotal. All rights reserved.  Intro to Spark based Projects – Spark SQL – Spark Streaming – MLBase – GraphX