SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Spark Camp @ Strata CA

An Intro to Apache Spark with Hands-on Tutorials
Wed Feb 18, 2015 9:00am–5:00pm
strataconf.com/big-data-conference-ca-2015/
Spark Camp @ Strata + Hadoop World
A day long hands-on introduction to the Spark platform including
Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib,
GraphX, and more…	

• overview of use cases and demonstrate writing 

simple Spark applications	

• cover each of the main components of the 

Spark stack	

• a series of technical talks targeted at developers 

who are new to Spark	

• intermixed with the talks will be periods of 

hands-on lab work
Spark Camp @ Strata + Hadoop World
Strata NY @ NYC

2014-10-15

~450 people
Strata EU @ Barcelona

2014-11-19

~250 people
Spark Camp: Ask Us Anything
Fri, Feb 20 2:20pm-3:00pm

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/40701
Join the Spark team for an informal question and
answer session. Several of the Spark committers,
trainers, etc., from Databricks will be on hand to
field a wide range of detailed questions. 	

Even if you don’t have a specific question, join 

in to hear what others are asking!
Apache Spark Advanced Training
Feb 17-19 9:00am-5:00pm

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/39399
Sameer Farooqui leads this new 3-day
training program offered by Databricks and
O’Reilly Media at Strata + Hadoop World
events worldwide.	

Participants will also receive limited free-tier
accounts on Databricks Cloud.	

Note: this sold out early, so if you want to
attend it at Strata EU, sign up quickly!
Spark Developer Certification

Fri Feb 20, 2015 10:40am-12:40pm
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
• 40 multiple-choice questions, 90 minutes	

• mostly structured as choices among code blocks	

• expect some Python, Java, Scala, SQL	

• understand theory of operation	

• identify best practices	

• recognize code that is more parallel, less
memory constrained	

!
Overall, you need to write Spark apps in practice
Developer Certification: Overview
7
Even More Apache Spark!

Feb 17-20, 2015
Keynote: New Directions for Spark in 2015
Fri Feb 20 9:15am-9:25am

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/39547
As the Apache Spark userbase grows, the developer community is working 

to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the
enterprise and major improvements in its performance, scalability and
standard libraries. In 2015, we want to make Spark accessible to a wider 

set of users, through new high-level APIs for data science: machine learning
pipelines, data frames, and R language bindings. In addition, we are defining
extension points to let Spark grow as a platform, making it easy to plug in
data sources, algorithms, and external packages. Like all work on Spark, 

these APIs are designed to plug seamlessly into Spark applications, giving
users a unified platform for streaming, batch and interactive data processing.
Matei Zaharia – started the Spark project
at UC Berkeley, currently CTO of Databricks,
SparkVP at Apache, and an assistant professor
at MIT
Databricks Spark Talks @ Strata + Hadoop World
Thu Feb 19 10:40am-11:20am

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38237
Lessons from Running Large Scale SparkWorkloads

Reynold Xin, Matei Zaharia
Thu Feb 19 4:00pm–4:40pm

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38518
Spark Streaming -The State of the Union, and Beyond

Tathagata Das
Databricks Spark Talks @ Strata + Hadoop World
Fri Feb 20 11:30am-12:10pm

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38237
Tuning and Debugging in Apache Spark

Patrick Wendell
Fri Feb 20 4:00pm–4:40pm

strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38391
Everyday I’m Shuffling -Tips forWriting Better Spark Programs

Vida Ha, Holden Karau
A Brief History
13
A Brief History: Functional Programming for Big Data
circa late 1990s: 

explosive growth e-commerce and machine data
implied that workloads could not fit on a single
computer anymore…	

notable firms led the shift to horizontal scale-out 

on clusters of commodity hardware, especially 

for machine learning use cases at scale
14
A Brief History: Functional Programming for Big Data
2002
2002
MapReduce @ Google
2004
MapReduce paper
2006
Hadoop @Yahoo!
2004 2006 2008 2010 2012 2014
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit
15
circa 2002: 

mitigate risk of large distributed workloads lost 

due to disk failures on commodity hardware…
Google File System 	

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung	

research.google.com/archive/gfs.html	

!
MapReduce: Simplified Data Processing on Large Clusters 	

Jeffrey Dean, Sanjay Ghemawat	

research.google.com/archive/mapreduce.html
A Brief History: MapReduce
16
MR doesn’t compose well for large applications, 

and so specialized systems emerged as workarounds
MapReduce
General Batch Processing Specialized Systems:
iterative, interactive, streaming, graph, etc.
Pregel Giraph
Dremel Drill
Tez
Impala
GraphLab
StormS4
F1
MillWheel
A Brief History: MapReduce
Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become 

one of the largest OSS communities in big data,
with over 200 contributors in 50+ organizations
spark.apache.org
“Organizations that are looking at big data challenges –

including collection, ETL, storage, exploration and analytics –

should consider Spark for its in-memory performance and

the breadth of its model. It supports advanced analytics

solutions on Hadoop clusters, including the iterative model

required for machine learning and graph analysis.”	

Gartner, Advanced Analytics and Data Science (2014)
17
A Brief History: Spark
18
A Brief History: Spark
Spark is one of the most active Apache projects
ohloh.net/orgs/apache
19
TL;DR: Sustained Exponential Growth
databricks.com/blog/2015/01/27/big-data-projects-are-
hungry-for-simpler-and-more-powerful-tools-survey-
validates-apache-spark-is-gaining-developer-traction.html
TL;DR: Spark Survey 2015 by Databricks +Typesafe
20
databricks.com/blog/2014/11/05/spark-officially-
sets-a-new-record-in-large-scale-sorting.html
TL;DR: SmashingThe Previous Petabyte Sort Record
21
oreilly.com/data/free/2014-data-science-
salary-survey.csp
TL;DR: Spark ExpertiseTops Median Salaries within Big Data
22
Unifying the Pieces
WordCount in 3 lines of Spark
WordCount in 50+ lines of Java MR
24
Simple Spark Apps: WordCount
val sqlContext = new org.apache.spark.sql.SQLContext(sc)!
import sqlContext._!
!
// Define the schema using a case class.!
case class Person(name: String, age: Int)!
!
// Create an RDD of Person objects and register it as a table.!
val people = sc.textFile("examples/src/main/resources/
people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))!
!
people.registerTempTable("people")!
!
// SQL statements can be run by using the sql methods provided by sqlContext.!
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")!
!
// The results of SQL queries are SchemaRDDs and support all the !
// normal RDD operations.!
// The columns of a row in the result can be accessed by ordinal.!
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
Data Workflows: Spark SQL
25
// http://spark.apache.org/docs/latest/streaming-programming-guide.html!
!
import org.apache.spark.streaming._!
import org.apache.spark.streaming.StreamingContext._!
!
// create a StreamingContext with a SparkConf configuration!
val ssc = new StreamingContext(sparkConf, Seconds(10))!
!
// create a DStream that will connect to serverIP:serverPort!
val lines = ssc.socketTextStream(serverIP, serverPort)!
!
// split each line into words!
val words = lines.flatMap(_.split(" "))!
!
// count each word in each batch!
val pairs = words.map(word => (word, 1))!
val wordCounts = pairs.reduceByKey(_ + _)!
!
// print a few of the counts to the console!
wordCounts.print()!
!
ssc.start() // start the computation!
ssc.awaitTermination() // wait for the computation to terminate
Data Workflows: Spark Streaming
26
27
spark.apache.org/docs/latest/mllib-guide.html	

!
Key Points:	

!
• framework vs. library	

• scale, parallelism, sparsity	

• building blocks for long-term approach
MLI: An API for Distributed Machine Learning	

Evan Sparks, Ameet Talwalkar, et al.	

International Conference on Data Mining (2013)	

http://arxiv.org/abs/1310.5426
Data Workflows: MLlib
Community

Resources
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!
video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
books:
Fast Data Processing 

with Spark

Holden Karau

Packt (2013)

shop.oreilly.com/product/
9781782167068.do
Spark in Action

Chris Fregly

Manning (2015*)

sparkinaction.com/
Learning Spark

Holden Karau, 

Andy Konwinski,
Matei Zaharia

O’Reilly (2015*)

shop.oreilly.com/product/
0636920028512.do
31
http://spark-summit.org/20% discount:

SSEDBFRIEND20

Weitere ähnliche Inhalte

Was ist angesagt?

Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark TrainingSpark Summit
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big DataPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastDatabricks
 

Was ist angesagt? (20)

Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
 

Andere mochten auch

Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on MesosPaco Nathan
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Big data & data science challenges and opportunities
Big data & data science   challenges and opportunitiesBig data & data science   challenges and opportunities
Big data & data science challenges and opportunitiesJose Quesada
 
Future of data science as a profession
Future of data science as a professionFuture of data science as a profession
Future of data science as a professionJose Quesada
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 

Andere mochten auch (10)

Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Big data & data science challenges and opportunities
Big data & data science   challenges and opportunitiesBig data & data science   challenges and opportunities
Big data & data science challenges and opportunities
 
Future of data science as a profession
Future of data science as a professionFuture of data science as a profession
Future of data science as a profession
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 

Ähnlich wie Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Edureka!
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Stratio
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkJerry Wen
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Edureka!
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySparkLadle Patel
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefitsJohan Picard
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev Kumar
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 

Ähnlich wie Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More (20)

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySpark
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developer
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 

Mehr von Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 

Mehr von Paco Nathan (8)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 

Kürzlich hochgeladen

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

  • 1. Spark Camp @ Strata CA
 An Intro to Apache Spark with Hands-on Tutorials Wed Feb 18, 2015 9:00am–5:00pm strataconf.com/big-data-conference-ca-2015/
  • 2. Spark Camp @ Strata + Hadoop World A day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, GraphX, and more… • overview of use cases and demonstrate writing 
 simple Spark applications • cover each of the main components of the 
 Spark stack • a series of technical talks targeted at developers 
 who are new to Spark • intermixed with the talks will be periods of 
 hands-on lab work
  • 3. Spark Camp @ Strata + Hadoop World Strata NY @ NYC
 2014-10-15
 ~450 people Strata EU @ Barcelona
 2014-11-19
 ~250 people
  • 4. Spark Camp: Ask Us Anything Fri, Feb 20 2:20pm-3:00pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/40701 Join the Spark team for an informal question and answer session. Several of the Spark committers, trainers, etc., from Databricks will be on hand to field a wide range of detailed questions. Even if you don’t have a specific question, join 
 in to hear what others are asking!
  • 5. Apache Spark Advanced Training Feb 17-19 9:00am-5:00pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/39399 Sameer Farooqui leads this new 3-day training program offered by Databricks and O’Reilly Media at Strata + Hadoop World events worldwide. Participants will also receive limited free-tier accounts on Databricks Cloud. Note: this sold out early, so if you want to attend it at Strata EU, sign up quickly!
  • 6. Spark Developer Certification
 Fri Feb 20, 2015 10:40am-12:40pm • http://oreilly.com/go/sparkcert • defined by Spark experts @Databricks • assessed by O’Reilly Media • establishes the bar for Spark expertise
  • 7. • 40 multiple-choice questions, 90 minutes • mostly structured as choices among code blocks • expect some Python, Java, Scala, SQL • understand theory of operation • identify best practices • recognize code that is more parallel, less memory constrained ! Overall, you need to write Spark apps in practice Developer Certification: Overview 7
  • 8. Even More Apache Spark!
 Feb 17-20, 2015
  • 9. Keynote: New Directions for Spark in 2015 Fri Feb 20 9:15am-9:25am
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/39547 As the Apache Spark userbase grows, the developer community is working 
 to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the enterprise and major improvements in its performance, scalability and standard libraries. In 2015, we want to make Spark accessible to a wider 
 set of users, through new high-level APIs for data science: machine learning pipelines, data frames, and R language bindings. In addition, we are defining extension points to let Spark grow as a platform, making it easy to plug in data sources, algorithms, and external packages. Like all work on Spark, 
 these APIs are designed to plug seamlessly into Spark applications, giving users a unified platform for streaming, batch and interactive data processing. Matei Zaharia – started the Spark project at UC Berkeley, currently CTO of Databricks, SparkVP at Apache, and an assistant professor at MIT
  • 10. Databricks Spark Talks @ Strata + Hadoop World Thu Feb 19 10:40am-11:20am
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38237 Lessons from Running Large Scale SparkWorkloads
 Reynold Xin, Matei Zaharia Thu Feb 19 4:00pm–4:40pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38518 Spark Streaming -The State of the Union, and Beyond
 Tathagata Das
  • 11. Databricks Spark Talks @ Strata + Hadoop World Fri Feb 20 11:30am-12:10pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38237 Tuning and Debugging in Apache Spark
 Patrick Wendell Fri Feb 20 4:00pm–4:40pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38391 Everyday I’m Shuffling -Tips forWriting Better Spark Programs
 Vida Ha, Holden Karau
  • 13. 13 A Brief History: Functional Programming for Big Data circa late 1990s: 
 explosive growth e-commerce and machine data implied that workloads could not fit on a single computer anymore… notable firms led the shift to horizontal scale-out 
 on clusters of commodity hardware, especially 
 for machine learning use cases at scale
  • 14. 14 A Brief History: Functional Programming for Big Data 2002 2002 MapReduce @ Google 2004 MapReduce paper 2006 Hadoop @Yahoo! 2004 2006 2008 2010 2012 2014 2014 Apache Spark top-level 2010 Spark paper 2008 Hadoop Summit
  • 15. 15 circa 2002: 
 mitigate risk of large distributed workloads lost 
 due to disk failures on commodity hardware… Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung research.google.com/archive/gfs.html ! MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean, Sanjay Ghemawat research.google.com/archive/mapreduce.html A Brief History: MapReduce
  • 16. 16 MR doesn’t compose well for large applications, 
 and so specialized systems emerged as workarounds MapReduce General Batch Processing Specialized Systems: iterative, interactive, streaming, graph, etc. Pregel Giraph Dremel Drill Tez Impala GraphLab StormS4 F1 MillWheel A Brief History: MapReduce
  • 17. Developed in 2009 at UC Berkeley AMPLab, then open sourced in 2010, Spark has since become 
 one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations spark.apache.org “Organizations that are looking at big data challenges –
 including collection, ETL, storage, exploration and analytics –
 should consider Spark for its in-memory performance and
 the breadth of its model. It supports advanced analytics
 solutions on Hadoop clusters, including the iterative model
 required for machine learning and graph analysis.” Gartner, Advanced Analytics and Data Science (2014) 17 A Brief History: Spark
  • 19. Spark is one of the most active Apache projects ohloh.net/orgs/apache 19 TL;DR: Sustained Exponential Growth
  • 24. WordCount in 3 lines of Spark WordCount in 50+ lines of Java MR 24 Simple Spark Apps: WordCount
  • 25. val sqlContext = new org.apache.spark.sql.SQLContext(sc)! import sqlContext._! ! // Define the schema using a case class.! case class Person(name: String, age: Int)! ! // Create an RDD of Person objects and register it as a table.! val people = sc.textFile("examples/src/main/resources/ people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))! ! people.registerTempTable("people")! ! // SQL statements can be run by using the sql methods provided by sqlContext.! val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")! ! // The results of SQL queries are SchemaRDDs and support all the ! // normal RDD operations.! // The columns of a row in the result can be accessed by ordinal.! teenagers.map(t => "Name: " + t(0)).collect().foreach(println) Data Workflows: Spark SQL 25
  • 26. // http://spark.apache.org/docs/latest/streaming-programming-guide.html! ! import org.apache.spark.streaming._! import org.apache.spark.streaming.StreamingContext._! ! // create a StreamingContext with a SparkConf configuration! val ssc = new StreamingContext(sparkConf, Seconds(10))! ! // create a DStream that will connect to serverIP:serverPort! val lines = ssc.socketTextStream(serverIP, serverPort)! ! // split each line into words! val words = lines.flatMap(_.split(" "))! ! // count each word in each batch! val pairs = words.map(word => (word, 1))! val wordCounts = pairs.reduceByKey(_ + _)! ! // print a few of the counts to the console! wordCounts.print()! ! ssc.start() // start the computation! ssc.awaitTermination() // wait for the computation to terminate Data Workflows: Spark Streaming 26
  • 27. 27 spark.apache.org/docs/latest/mllib-guide.html ! Key Points: ! • framework vs. library • scale, parallelism, sparsity • building blocks for long-term approach MLI: An API for Distributed Machine Learning Evan Sparks, Ameet Talwalkar, et al. International Conference on Data Mining (2013) http://arxiv.org/abs/1310.5426 Data Workflows: MLlib
  • 29. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK ! video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training
  • 30. books: Fast Data Processing 
 with Spark
 Holden Karau
 Packt (2013)
 shop.oreilly.com/product/ 9781782167068.do Spark in Action
 Chris Fregly
 Manning (2015*)
 sparkinaction.com/ Learning Spark
 Holden Karau, 
 Andy Konwinski, Matei Zaharia
 O’Reilly (2015*)
 shop.oreilly.com/product/ 0636920028512.do