SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Apache Spark and the Emerging
Technology Landscape for Big Data
Universidade da Coruña

2015-05-27"
Paco Nathan @pacoid

slides http://goo.gl/A0WL8y
Big Data: Intentions
Getting started working
with Big Data may seem 

like going after a big fish…
Big Data: Realities
However, learning a variety of complex Big Data
frameworks feels more like the perspective of 

a tuna caught in the labyrinth of La Almadraba
MR doesn’t compose well for large applications, 

and so specialized systems emerged as workarounds
MapReduce
General Batch Processing Specialized Systems:
iterative, interactive, streaming, graph, etc.
Pregel Giraph
Dremel Drill
Tez
Impala
GraphLab
StormS4
F1
MillWheel
Big Data: Many Specialized Systems
Big Data: UnifiedWorkflows based on Apache Spark
Big Data: What is Spark?
• leverages current generation of commodity hardware	

• organizes data as Resilient Distributed Datasets (RDD) 	

• provides fault tolerance and parallel processing at scale	

• lazy eval of DAG optimizes pipelining in cluster
computing	

• functional programming simplifies SQL, Streaming, 

ML, Graphs, etc.	

• unified engine removes need for many specialized
systems
WordCount in 3 lines of Spark
WordCount in 50+ lines of Java MR
Big Data: What is Spark?
Big Data: What is Spark? – Logical Architecture
A:
stage 1
B:
C:
stage 2
D:
stage 3
E:
map() map()
map() map()
join()
cached
partition
RDD
Big Data: What is Spark? – Physical Architecture
Cluster ManagerDriver Program
SparkContext
Worker Node 1
Executor 1 cache
tasktask
block 1
Worker Node 2
Executor 2 cache
tasktask
block 2
databricks.com/blog/2014/11/05/spark-officially-
sets-a-new-record-in-large-scale-sorting.html
Results: Gray Sort Challenge –World Record
twitter.com/dberkholz/status/
568561792751771648
Results: Spark on StackOverflow
Big Data: Personal Observation
Apache Spark represents an opportunity for
collaboration between industry and academia:	

More so than previous Big Data frameworks,
Spark is pure open source and successful

use cases at scale leverage algorithms and
mathematics in a more fundamental way	

• academia needs real-world data from industry	

• industry needs experts in latest techniques 

from academia
Backstory: themes 

and primary sources
1930s
Lambda
Calculus
Combinators
1970s
Graph
Reduction
Algebra for
Applicative
Systems
query
optimization
functional
programming
parallel
processing
Backstory: 1930s to 1970s
Early work in the theory of computation led to 

foundations for functional programming, parallel
processing, and query optimization methods…
Backstory: 1930s to 1970s –Themes
• abstraction layers provided a structured context 

for defining functions	

• algebraic properties of functions allowed for 

improved compiler optimizations	

• parallelism leveraged semigroup structure of 

the data	

• lazy evaluation of a graph, used in functional
programming	

• lacked the needed compute power
Wikipedia
Backstory: 1930s to 1970s – Primary Sources
“Computability and λ-Definability”

Alan Turing

The Journal of Symbolic Logic 2 (4): (1937), 153-163

10.2307/2268280	

“Can Programming Be Liberated from the von Neumann

Style? A Functional Style and Its Algebra of Programs”

John Backus

ACMTuring Award (1977)

stanford.edu/class/cs242/readings/backus.pdf	

“A new implementation technique for applicative languages”

David Turner

Softw: Pract. Exper., 9: (1979), 31-49

10.1002/spe.4380090105
Backstory: 1990s to early 2000s
Initial successes in e-commerce led to the origins 

of algorithmic modeling, horizontal scale-out, the 

Big Data “flywheel”, then subsequently
MapReduce…
2001
functional
programming
1990s
GFS +
MapReduce
2004
Origins of
Big Data
Algorithmic
Modeling
parallel
processing
Backstory: 1990s to early 2000s –Themes
• AMZN, EBAY, GOOG,YHOO avoided paying $$ 

to IBM, ORCL, etc., by scaling out clusters of 

commodity hardware	

• Big Data “flywheel” pushed horizontal scale-out	

• culture of “Data Modeling” shifted to a culture of
“Algorithmic Modeling” at scale	

• GOOG required fault-tolerance for large ML jobs	

• circa 2002 hardware drove MapReduce design
Machine
Data
Algorithmic
Modeling Ecommerce
Use Cases
Social Interaction
History
Aggregation
at Scale
Data
Products
Backstory: 1990s to early 2000s – Primary Sources
“Social information filtering”

Upendra Shardanand, Pattie Maes

CHI (1995), 210-217

dl.acm.org/citation.cfm?id=223931	

“Early Amazon: Splitting the website”

Greg Linden

glinden.blogspot.com/2006/02/early-amazon-splitting-website.html	

“The eBay Architecture”

Randy Shoup, Dan Pritchett

addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf	

“Inktomi’s Wild Ride”

Erik Brewer (0:05:31 ff)

youtu.be/E91oEn1bnXM	

“Statistical Modeling:The Two Cultures”

Leo Breiman

Statist. Sci. 16:3 (2001), 199-231

10.1214/ss/1009213726	

MapReduce: Simplified Data Processing on Large Clusters 	

Jeffrey Dean, Sanjay Ghemawat

OSDI (2004)

research.google.com/archive/mapreduce.html
Backstory: late 2000s
Projects built on top of MapReduce, with Apache
Hadoop gaining significant traction in industry…
2008
functional
programming
Hadoop
@Yahoo!
DryadLINQ
query
optimization
2004
GFS +
MapReduce
2006
Backstory: late 2000s –Themes
• batch jobs using MapReduce on clusters of commodity
hardware proved valuable in industry	

• specialized systems emerged because many use cases
had requirements beyond batch (SQL, real-time, etc.)	

• abstraction layers emerged because it is difficult to hire
enough engineers to “think” in MapReduce	

• functional programming reduced software engineering
costs for Big Data apps
Apache Hadoop
Backstory: late 2000s – Primary Sources
“Hadoop, a brief history”

Doug Cutting

Yahoo! (2006)

research.yahoo.com/files/cutting.pdf	

“Improving MapReduce Performance in Heterogeneous Environments”

Matei Zaharia, et al.

OSDI: (2008), 29-42

dl.acm.org/citation.cfm?id=1855744	

“DryadLINQ:A System for General-Purpose Distributed 

Data-Parallel Computing Using a High-Level Language”

Yuan Yu, et al.

OSDI: (2008) 1-4

research.microsoft.com/en-us/projects/DryadLINQ/
Backstory: early 2010s
Apache Spark emerged from the Apache Mesos
project, leveraging Scala, with the capability to
subsume many of the specialized systems that 

had become popular for Big Data…
2010
Hadoop
@Yahoo!
DryadLINQ
Spark
paper
Scala
Mesos
2000s
Backstory: early 2010s –Themes
• generalized patterns led to a unified engine for many 

use cases	

• lazy evaluation of the lineage graph reduced wait states,
improved pipelining (less synchronization barriers)	

• generational differences in hardware, e.g., off-heap use 

of large memory spaces	

• functional programming improved ease of use, reduced
costs to maintain large apps	

• lower overhead for starting jobs, less expensive shuffles
Backstory: early 2010s – Primary Sources
“The Origins of Scala”

Bill Venners, Frank Sommers

Scalazine (2009-05-04)

artima.com/scalazine/articles/origins_of_scala.html	

“Spark: Cluster Computing with Working Sets”

Matei Zaharia, et al.

HotCloud: (2010)

dl.acm.org/citation.cfm?id=1863103.1863113	

“Mesos:A Platform for Fine-Grained Resource Sharing in the Data Center”

Benjamin Hindman, et al.

NSDI: (2011), 295-308

dl.acm.org/citation.cfm?id=1972488	

“Resilient Distributed Datasets:A Fault-Tolerant Abstraction for In-Memory Cluster Computing”

Matei Zaharia, et al.

NSDI: (2012)

dl.acm.org/citation.cfm?id=2228301
2014
Spark
Core
Apache Spark
release 1.0
Spark
SQL
2015
DataFrames
2008
Pandas
R
2013
Parquet
Tungsten
Backstory: mid 2010s
Apache Spark became one of the most popular
frameworks for Big Data…
Backstory: mid 2010s –Themes
• DataFrames provide an excepted metaphor, as a higher
abstraction layer than RDDs, leveraging Catalyst optimizer	

• Parquet as a best practice: columnar store, excellent
compression, preserves schema, pushdown predicates, etc.	

• Tungsten: application semantics help mitigate the overhead 

of JVM object model and GC; cache-aware computation
leverages memory hierarchy; code generation exploits
modern compilers and CPUs
Backstory: mid 2010s – Primary Sources
Python for Data Analysis: DataWrangling with Pandas, NumPy, and IPython

Wes McKinney

O'Reilly Media (2012)

http://shop.oreilly.com/product/0636920023784.do	

“Parquet: Columnar storage for the people”

Julien Le Dem

Strata + Hadoop World, NewYork (2013)

parquet.apache.org/presentations/	

“Spark SQL: Manipulating Structured Data Using Spark”

Michael Armbrust, Reynold Xin

databricks.com/blog/2014/03/26/Spark-SQL-manipulating-
structured-data-using-Spark.html	

“Introducing DataFrames in Spark for Large Scale Data Science”

Reynold Xin, Michael Armbrust, Davies Liu

databricks.com/blog/2015/02/17/introducing-dataframes-in-
spark-for-large-scale-data-science.html	

“Project Tungsten: Bringing Spark Closer to Bare Metal”

Reynold Xin, Josh Rosen

databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-
closer-to-bare-metal.html
Disruptive Patterns
Three suggested areas for R&D, among the 

most disruptive innovations based on Spark:	

1. Streaming analytics, especially stateful apps that leverage
approximation algorithms – with demand driven by IoT	

2. Generalized ML workflows, especially large-scale matrix 

factorization and convex optimization for industry uses	

3. Cloud-based notebooks, building on containers, IPython, 

and DataFrames, as tools for collaboration and teaching 

– ultimately as disruptive as spreadsheets in the 1980s
Disruptive Patterns:
Three suggested areas for R&D, among the
most disruptive innovations based on Spark:	

1. Streaming analytics, especially stateful apps that leverage
approximation algorithms – with demand driven by IoT
2. Generalized ML workflows, especially large-scale matrix
factorization and convex optimization for industry uses
3. Cloud-based notebooks, building on containers, IPython,
and DataFrames, as tools for collaboration and teaching
– ultimately as disruptive as spreadsheets in the 1980s
Disruptive Patterns:
What is the story here?
Run a streaming computation in Spark as: 

a series of very small, deterministic batch jobs	

!
• Chop up the live stream into 

batches of X seconds 	

• Spark treats each batch of 

data as RDDs and processes 

them using RDD operations	

• Finally, the processed results 

of the RDD operations are 

returned in batches
Spark Streaming: micro-batch approach
import sys!
from pyspark import SparkContext!
from pyspark.streaming import StreamingContext!
!
def updateFunc (new_values, last_sum):!
return sum(new_values) + (last_sum or 0)!
!
sc = SparkContext(appName="PyStreamNWC", master="local[*]")!
ssc = StreamingContext(sc, 5)!
ssc.checkpoint("checkpoint")!
!
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))!
!
counts = lines.flatMap(lambda line: line.split(" ")) !
.map(lambda word: (word, 1)) !
.updateStateByKey(updateFunc) !
.transform(lambda x: x.sortByKey())!
!
counts.pprint()!
!
ssc.start()!
ssc.awaitTermination()
Spark Streaming: example – stateful streaming app
Spark Streaming: Virdata tutorial – tuning
Tuning Spark Streaming forThroughput	

Gerard Maas, 2014-12-22	

virdata.com/tuning-spark/
Spark Streaming: Netflix tutorial – resiliency
Can Spark Streaming survive Chaos Monkey?	

Bharat Venkat, Prasanna Padmanabhan, 

Antony Arokiasamy, Raju Uppalapati	

techblog.netflix.com/2015/03/can-spark-streaming-
survive-chaos-monkey.html
Spark Streaming: resiliency illustrated
backpressure

(flow control is a hard problem)	

reliable receiver	

in-memory replication

write ahead log (data)	

driver restart

checkpoint (metadata)	

multiple masters	

worker relaunch

executor relaunch
storage
framework
driver
worker
worker
worker
worker
worker
worker
receiver
receiver
source
sender
source
sender
source
sender
receiver
masters
21c. shift towards modeling based on probabilistic
approximations: trade bounded errors for greatly
reduced resource costs
highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-
data-mining/
A Big Picture…
21c. shift towards modeling based on probabil
approximations: trade bounded errors for greatly
reduced resource costs
highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-
data-mining/
A Big Picture…
Twitter catch-phrase: 	

“Hash, don’t sample”
Streaming Algorithms: example usage
algorithm use case example
Count-Min Sketch frequency summaries code
HyperLogLog set cardinality code
Bloom Filter set membership
MinHash	

 set similarity
DSQ streaming quantiles
SkipList ordered sequence search
This pattern of Kafka, Spark, Cassandra, and
generally Mesos is sometimes called “Team Apache”
– frameworks distinct from the Hadoop stack and
its vendors, displacing them as industry demands
real-time insights at scale, leading to an IoT
“flywheel” effect…
Streaming Algorithms:Themes
20111999 2012
MillWheel
Streaming
Algorithms
Spark
Streaming
Spark
Core
2000s
Key/Value
Stores
Algebird
Streaming Algorithms: Primary Sources
“The Space Complexity of Approximating the Frequency Moments”

Noga Alon, Yossi Matias, Mario Szegedy

JCSS 58:1, (Feb 1999), 137-147

10.1006/jcss.1997.1545	

“Cassandra - A Decentralized Structured Storage System”

Avinash Lakshman, Prashant Malik

ACM SIGOPS 44:2 (Apr 2010), 35-40

10.1145/1773912.1773922	

Algebird

Avi Bryant, Oscar Boykin, et al. 

Twitter (2012)

engineering.twitter.com/opensource/projects/algebird	

“Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing”

Matei Zaharia, Tathagata Das, et al.

Berkeley EECS (2012-12-14)

www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf	

“MillWheel: Fault-Tolerant Stream Processing at Internet Scale”

Tyler Akidau, et al.

VLDB (2013)

research.google.com/pubs/pub41378.html
Add ALL theThings:

Abstract Algebra Meets Analytics

infoq.com/presentations/abstract-algebra-analytics

Avi Bryant, Strange Loop (2013)	

• grouping doesn’t matter (associativity)	

• ordering doesn’t matter (commutativity)	

• zeros get ignored	

In other words, while partitioning data at
scale is quite difficult, you can let the math
allow your code to be flexible at scale
Avi Bryant

@avibryant
Streaming Algorithms: performance at scale
Spark Streaming: system integrator
Stratio Streaming: a new approach to 

Spark Streaming	

David Morales, Oscar Mendez	

2014-06-30	

spark-summit.org/2014/talk/stratio-streaming-
a-new-approach-to-spark-streaming
• Stratio Streaming is the union of a real-time
messaging bus with a complex event processing
engine using Spark Streaming	

• allows creation of streams and queries on the fly	

• paired with Siddhi CEP engine and Apache Kafka	

• added global features to the engine such as auditing
and statistics	

• use cases: large banks, retail, travel, etc.	

• using Apache Mesos
Spark Streaming: neuroscience research
Analytics +Visualization for Neuroscience:
Spark,Thunder, Lightning	

Jeremy Freeman

2015-01-29	

youtu.be/cBQm4LhHn9g?t=28m55s
• neuroscience studies: zebrafish, rats, etc.	

• see http://codeneuro.org/	

• real-time ML for laser control	

• 2 TB/hour per fish	

• 80 HPC nodes
Spark Streaming: geospatial analytics
Plot all the data – interactive visualization 

of massive datasets 	

Rob Harper, Nathan Kronenfeld

2015-05-20	

uncharted.software/spark-summit-east-2015-presentation/
• geospatial analytics – especially given sensor data,
remote sensing from micro satellites, etc.	

• sophisticated interactive visualization	

• open source http://aperturetiles.com/	

• see http://pantera.io/
Further Resources
Spark Developer Certification

• go.databricks.com/spark-certified-developer
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
YouTube channel: goo.gl/N5Hx3h
!
video+preso archives: spark-summit.org
resources: databricks.com/spark/developer-resources
workshops: databricks.com/spark/training
MOOCs:
Anthony Joseph

UC Berkeley	

begins Jun 2015	

edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar

UCLA	

begins Jun 2015	

edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
confs: Strata EU

London, May 5-7

strataconf.com/big-data-conference-uk-2015
GOTO Chicago

Chicago, May 11-14

gotocon.com/chicago-2015
Spark Summit 2015

SF, Jun 15-17

spark-summit.org
Spark Summit EU

(to be announced)
books+videos:
Fast Data Processing 

with Spark

Holden Karau

Packt (2013)

shop.oreilly.com/
product/
9781782167068.do
Spark in Action

Chris Fregly

Manning (2015)

sparkinaction.com/
Learning Spark

Holden Karau, 

Andy Konwinski,

Parick Wendell, 

Matei Zaharia

O’Reilly (2015)

shop.oreilly.com/
product/
0636920028512.do
Intro to Apache Spark

Paco Nathan

O’Reilly (2015)

shop.oreilly.com/
product/
0636920036807.do
Advanced Analytics
with Spark

Sandy Ryza, 

Uri Laserson,

Sean Owen, 

Josh Wills

O’Reilly (2014)

shop.oreilly.com/
product/
0636920035091.do
presenter:
Just Enough Math
O’Reilly, 2014
justenoughmath.com

preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates, 

events, conf summaries, etc.:
liber118.com/pxn/
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
Moitas graciñas

¿Preguntas?

Weitere ähnliche Inhalte

Was ist angesagt?

Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark TrainingSpark Summit
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastDatabricks
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 

Was ist angesagt? (20)

Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 

Andere mochten auch

Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on MesosPaco Nathan
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Future of data science as a profession
Future of data science as a professionFuture of data science as a profession
Future of data science as a professionJose Quesada
 
Big data & data science challenges and opportunities
Big data & data science   challenges and opportunitiesBig data & data science   challenges and opportunities
Big data & data science challenges and opportunitiesJose Quesada
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCPaco Nathan
 

Andere mochten auch (10)

Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Future of data science as a profession
Future of data science as a professionFuture of data science as a profession
Future of data science as a profession
 
Big data & data science challenges and opportunities
Big data & data science   challenges and opportunitiesBig data & data science   challenges and opportunities
Big data & data science challenges and opportunities
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DC
 

Ähnlich wie Apache Spark and the Emerging Technology Landscape for Big Data

Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill MapR Technologies
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningLi Miao
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationDenodo
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningGianvito Siciliano
 
Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and BeyondPaco Nathan
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 

Ähnlich wie Apache Spark and the Emerging Technology Landscape for Big Data (20)

04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text Mining
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and Beyond
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 

Mehr von Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 

Mehr von Paco Nathan (8)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 

Kürzlich hochgeladen

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Kürzlich hochgeladen (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

Apache Spark and the Emerging Technology Landscape for Big Data

  • 1. Apache Spark and the Emerging Technology Landscape for Big Data Universidade da Coruña
 2015-05-27" Paco Nathan @pacoid
 slides http://goo.gl/A0WL8y
  • 2. Big Data: Intentions Getting started working with Big Data may seem 
 like going after a big fish…
  • 3. Big Data: Realities However, learning a variety of complex Big Data frameworks feels more like the perspective of 
 a tuna caught in the labyrinth of La Almadraba
  • 4. MR doesn’t compose well for large applications, 
 and so specialized systems emerged as workarounds MapReduce General Batch Processing Specialized Systems: iterative, interactive, streaming, graph, etc. Pregel Giraph Dremel Drill Tez Impala GraphLab StormS4 F1 MillWheel Big Data: Many Specialized Systems
  • 5. Big Data: UnifiedWorkflows based on Apache Spark
  • 6. Big Data: What is Spark? • leverages current generation of commodity hardware • organizes data as Resilient Distributed Datasets (RDD) • provides fault tolerance and parallel processing at scale • lazy eval of DAG optimizes pipelining in cluster computing • functional programming simplifies SQL, Streaming, 
 ML, Graphs, etc. • unified engine removes need for many specialized systems
  • 7. WordCount in 3 lines of Spark WordCount in 50+ lines of Java MR Big Data: What is Spark?
  • 8. Big Data: What is Spark? – Logical Architecture A: stage 1 B: C: stage 2 D: stage 3 E: map() map() map() map() join() cached partition RDD
  • 9. Big Data: What is Spark? – Physical Architecture Cluster ManagerDriver Program SparkContext Worker Node 1 Executor 1 cache tasktask block 1 Worker Node 2 Executor 2 cache tasktask block 2
  • 12. Big Data: Personal Observation Apache Spark represents an opportunity for collaboration between industry and academia: More so than previous Big Data frameworks, Spark is pure open source and successful
 use cases at scale leverage algorithms and mathematics in a more fundamental way • academia needs real-world data from industry • industry needs experts in latest techniques 
 from academia
  • 13. Backstory: themes 
 and primary sources
  • 14. 1930s Lambda Calculus Combinators 1970s Graph Reduction Algebra for Applicative Systems query optimization functional programming parallel processing Backstory: 1930s to 1970s Early work in the theory of computation led to 
 foundations for functional programming, parallel processing, and query optimization methods…
  • 15. Backstory: 1930s to 1970s –Themes • abstraction layers provided a structured context 
 for defining functions • algebraic properties of functions allowed for 
 improved compiler optimizations • parallelism leveraged semigroup structure of 
 the data • lazy evaluation of a graph, used in functional programming • lacked the needed compute power Wikipedia
  • 16. Backstory: 1930s to 1970s – Primary Sources “Computability and λ-Definability”
 Alan Turing
 The Journal of Symbolic Logic 2 (4): (1937), 153-163
 10.2307/2268280 “Can Programming Be Liberated from the von Neumann
 Style? A Functional Style and Its Algebra of Programs”
 John Backus
 ACMTuring Award (1977)
 stanford.edu/class/cs242/readings/backus.pdf “A new implementation technique for applicative languages”
 David Turner
 Softw: Pract. Exper., 9: (1979), 31-49
 10.1002/spe.4380090105
  • 17. Backstory: 1990s to early 2000s Initial successes in e-commerce led to the origins 
 of algorithmic modeling, horizontal scale-out, the 
 Big Data “flywheel”, then subsequently MapReduce… 2001 functional programming 1990s GFS + MapReduce 2004 Origins of Big Data Algorithmic Modeling parallel processing
  • 18. Backstory: 1990s to early 2000s –Themes • AMZN, EBAY, GOOG,YHOO avoided paying $$ 
 to IBM, ORCL, etc., by scaling out clusters of 
 commodity hardware • Big Data “flywheel” pushed horizontal scale-out • culture of “Data Modeling” shifted to a culture of “Algorithmic Modeling” at scale • GOOG required fault-tolerance for large ML jobs • circa 2002 hardware drove MapReduce design Machine Data Algorithmic Modeling Ecommerce Use Cases Social Interaction History Aggregation at Scale Data Products
  • 19. Backstory: 1990s to early 2000s – Primary Sources “Social information filtering”
 Upendra Shardanand, Pattie Maes
 CHI (1995), 210-217
 dl.acm.org/citation.cfm?id=223931 “Early Amazon: Splitting the website”
 Greg Linden
 glinden.blogspot.com/2006/02/early-amazon-splitting-website.html “The eBay Architecture”
 Randy Shoup, Dan Pritchett
 addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf “Inktomi’s Wild Ride”
 Erik Brewer (0:05:31 ff)
 youtu.be/E91oEn1bnXM “Statistical Modeling:The Two Cultures”
 Leo Breiman
 Statist. Sci. 16:3 (2001), 199-231
 10.1214/ss/1009213726 MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean, Sanjay Ghemawat
 OSDI (2004)
 research.google.com/archive/mapreduce.html
  • 20. Backstory: late 2000s Projects built on top of MapReduce, with Apache Hadoop gaining significant traction in industry… 2008 functional programming Hadoop @Yahoo! DryadLINQ query optimization 2004 GFS + MapReduce 2006
  • 21. Backstory: late 2000s –Themes • batch jobs using MapReduce on clusters of commodity hardware proved valuable in industry • specialized systems emerged because many use cases had requirements beyond batch (SQL, real-time, etc.) • abstraction layers emerged because it is difficult to hire enough engineers to “think” in MapReduce • functional programming reduced software engineering costs for Big Data apps Apache Hadoop
  • 22. Backstory: late 2000s – Primary Sources “Hadoop, a brief history”
 Doug Cutting
 Yahoo! (2006)
 research.yahoo.com/files/cutting.pdf “Improving MapReduce Performance in Heterogeneous Environments”
 Matei Zaharia, et al.
 OSDI: (2008), 29-42
 dl.acm.org/citation.cfm?id=1855744 “DryadLINQ:A System for General-Purpose Distributed 
 Data-Parallel Computing Using a High-Level Language”
 Yuan Yu, et al.
 OSDI: (2008) 1-4
 research.microsoft.com/en-us/projects/DryadLINQ/
  • 23. Backstory: early 2010s Apache Spark emerged from the Apache Mesos project, leveraging Scala, with the capability to subsume many of the specialized systems that 
 had become popular for Big Data… 2010 Hadoop @Yahoo! DryadLINQ Spark paper Scala Mesos 2000s
  • 24. Backstory: early 2010s –Themes • generalized patterns led to a unified engine for many 
 use cases • lazy evaluation of the lineage graph reduced wait states, improved pipelining (less synchronization barriers) • generational differences in hardware, e.g., off-heap use 
 of large memory spaces • functional programming improved ease of use, reduced costs to maintain large apps • lower overhead for starting jobs, less expensive shuffles
  • 25. Backstory: early 2010s – Primary Sources “The Origins of Scala”
 Bill Venners, Frank Sommers
 Scalazine (2009-05-04)
 artima.com/scalazine/articles/origins_of_scala.html “Spark: Cluster Computing with Working Sets”
 Matei Zaharia, et al.
 HotCloud: (2010)
 dl.acm.org/citation.cfm?id=1863103.1863113 “Mesos:A Platform for Fine-Grained Resource Sharing in the Data Center”
 Benjamin Hindman, et al.
 NSDI: (2011), 295-308
 dl.acm.org/citation.cfm?id=1972488 “Resilient Distributed Datasets:A Fault-Tolerant Abstraction for In-Memory Cluster Computing”
 Matei Zaharia, et al.
 NSDI: (2012)
 dl.acm.org/citation.cfm?id=2228301
  • 26. 2014 Spark Core Apache Spark release 1.0 Spark SQL 2015 DataFrames 2008 Pandas R 2013 Parquet Tungsten Backstory: mid 2010s Apache Spark became one of the most popular frameworks for Big Data…
  • 27. Backstory: mid 2010s –Themes • DataFrames provide an excepted metaphor, as a higher abstraction layer than RDDs, leveraging Catalyst optimizer • Parquet as a best practice: columnar store, excellent compression, preserves schema, pushdown predicates, etc. • Tungsten: application semantics help mitigate the overhead 
 of JVM object model and GC; cache-aware computation leverages memory hierarchy; code generation exploits modern compilers and CPUs
  • 28. Backstory: mid 2010s – Primary Sources Python for Data Analysis: DataWrangling with Pandas, NumPy, and IPython
 Wes McKinney
 O'Reilly Media (2012)
 http://shop.oreilly.com/product/0636920023784.do “Parquet: Columnar storage for the people”
 Julien Le Dem
 Strata + Hadoop World, NewYork (2013)
 parquet.apache.org/presentations/ “Spark SQL: Manipulating Structured Data Using Spark”
 Michael Armbrust, Reynold Xin
 databricks.com/blog/2014/03/26/Spark-SQL-manipulating- structured-data-using-Spark.html “Introducing DataFrames in Spark for Large Scale Data Science”
 Reynold Xin, Michael Armbrust, Davies Liu
 databricks.com/blog/2015/02/17/introducing-dataframes-in- spark-for-large-scale-data-science.html “Project Tungsten: Bringing Spark Closer to Bare Metal”
 Reynold Xin, Josh Rosen
 databricks.com/blog/2015/04/28/project-tungsten-bringing-spark- closer-to-bare-metal.html
  • 30. Three suggested areas for R&D, among the 
 most disruptive innovations based on Spark: 1. Streaming analytics, especially stateful apps that leverage approximation algorithms – with demand driven by IoT 2. Generalized ML workflows, especially large-scale matrix 
 factorization and convex optimization for industry uses 3. Cloud-based notebooks, building on containers, IPython, 
 and DataFrames, as tools for collaboration and teaching 
 – ultimately as disruptive as spreadsheets in the 1980s Disruptive Patterns:
  • 31. Three suggested areas for R&D, among the most disruptive innovations based on Spark: 1. Streaming analytics, especially stateful apps that leverage approximation algorithms – with demand driven by IoT 2. Generalized ML workflows, especially large-scale matrix factorization and convex optimization for industry uses 3. Cloud-based notebooks, building on containers, IPython, and DataFrames, as tools for collaboration and teaching – ultimately as disruptive as spreadsheets in the 1980s Disruptive Patterns: What is the story here?
  • 32. Run a streaming computation in Spark as: 
 a series of very small, deterministic batch jobs ! • Chop up the live stream into 
 batches of X seconds • Spark treats each batch of 
 data as RDDs and processes 
 them using RDD operations • Finally, the processed results 
 of the RDD operations are 
 returned in batches Spark Streaming: micro-batch approach
  • 33. import sys! from pyspark import SparkContext! from pyspark.streaming import StreamingContext! ! def updateFunc (new_values, last_sum):! return sum(new_values) + (last_sum or 0)! ! sc = SparkContext(appName="PyStreamNWC", master="local[*]")! ssc = StreamingContext(sc, 5)! ssc.checkpoint("checkpoint")! ! lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))! ! counts = lines.flatMap(lambda line: line.split(" ")) ! .map(lambda word: (word, 1)) ! .updateStateByKey(updateFunc) ! .transform(lambda x: x.sortByKey())! ! counts.pprint()! ! ssc.start()! ssc.awaitTermination() Spark Streaming: example – stateful streaming app
  • 34. Spark Streaming: Virdata tutorial – tuning Tuning Spark Streaming forThroughput Gerard Maas, 2014-12-22 virdata.com/tuning-spark/
  • 35. Spark Streaming: Netflix tutorial – resiliency Can Spark Streaming survive Chaos Monkey? Bharat Venkat, Prasanna Padmanabhan, 
 Antony Arokiasamy, Raju Uppalapati techblog.netflix.com/2015/03/can-spark-streaming- survive-chaos-monkey.html
  • 36. Spark Streaming: resiliency illustrated backpressure
 (flow control is a hard problem) reliable receiver in-memory replication
 write ahead log (data) driver restart
 checkpoint (metadata) multiple masters worker relaunch
 executor relaunch storage framework driver worker worker worker worker worker worker receiver receiver source sender source sender source sender receiver masters
  • 37. 21c. shift towards modeling based on probabilistic approximations: trade bounded errors for greatly reduced resource costs highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics- data-mining/ A Big Picture…
  • 38. 21c. shift towards modeling based on probabil approximations: trade bounded errors for greatly reduced resource costs highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics- data-mining/ A Big Picture… Twitter catch-phrase: “Hash, don’t sample”
  • 39. Streaming Algorithms: example usage algorithm use case example Count-Min Sketch frequency summaries code HyperLogLog set cardinality code Bloom Filter set membership MinHash set similarity DSQ streaming quantiles SkipList ordered sequence search
  • 40. This pattern of Kafka, Spark, Cassandra, and generally Mesos is sometimes called “Team Apache” – frameworks distinct from the Hadoop stack and its vendors, displacing them as industry demands real-time insights at scale, leading to an IoT “flywheel” effect… Streaming Algorithms:Themes 20111999 2012 MillWheel Streaming Algorithms Spark Streaming Spark Core 2000s Key/Value Stores Algebird
  • 41. Streaming Algorithms: Primary Sources “The Space Complexity of Approximating the Frequency Moments”
 Noga Alon, Yossi Matias, Mario Szegedy
 JCSS 58:1, (Feb 1999), 137-147
 10.1006/jcss.1997.1545 “Cassandra - A Decentralized Structured Storage System”
 Avinash Lakshman, Prashant Malik
 ACM SIGOPS 44:2 (Apr 2010), 35-40
 10.1145/1773912.1773922 Algebird
 Avi Bryant, Oscar Boykin, et al. 
 Twitter (2012)
 engineering.twitter.com/opensource/projects/algebird “Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing”
 Matei Zaharia, Tathagata Das, et al.
 Berkeley EECS (2012-12-14)
 www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf “MillWheel: Fault-Tolerant Stream Processing at Internet Scale”
 Tyler Akidau, et al.
 VLDB (2013)
 research.google.com/pubs/pub41378.html
  • 42. Add ALL theThings:
 Abstract Algebra Meets Analytics
 infoq.com/presentations/abstract-algebra-analytics
 Avi Bryant, Strange Loop (2013) • grouping doesn’t matter (associativity) • ordering doesn’t matter (commutativity) • zeros get ignored In other words, while partitioning data at scale is quite difficult, you can let the math allow your code to be flexible at scale Avi Bryant
 @avibryant Streaming Algorithms: performance at scale
  • 43. Spark Streaming: system integrator Stratio Streaming: a new approach to 
 Spark Streaming David Morales, Oscar Mendez 2014-06-30 spark-summit.org/2014/talk/stratio-streaming- a-new-approach-to-spark-streaming • Stratio Streaming is the union of a real-time messaging bus with a complex event processing engine using Spark Streaming • allows creation of streams and queries on the fly • paired with Siddhi CEP engine and Apache Kafka • added global features to the engine such as auditing and statistics • use cases: large banks, retail, travel, etc. • using Apache Mesos
  • 44. Spark Streaming: neuroscience research Analytics +Visualization for Neuroscience: Spark,Thunder, Lightning Jeremy Freeman
 2015-01-29 youtu.be/cBQm4LhHn9g?t=28m55s • neuroscience studies: zebrafish, rats, etc. • see http://codeneuro.org/ • real-time ML for laser control • 2 TB/hour per fish • 80 HPC nodes
  • 45. Spark Streaming: geospatial analytics Plot all the data – interactive visualization 
 of massive datasets Rob Harper, Nathan Kronenfeld
 2015-05-20 uncharted.software/spark-summit-east-2015-presentation/ • geospatial analytics – especially given sensor data, remote sensing from micro satellites, etc. • sophisticated interactive visualization • open source http://aperturetiles.com/ • see http://pantera.io/
  • 47. Spark Developer Certification
 • go.databricks.com/spark-certified-developer • defined by Spark experts @Databricks • assessed by O’Reilly Media • establishes the bar for Spark expertise
  • 48. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK YouTube channel: goo.gl/N5Hx3h ! video+preso archives: spark-summit.org resources: databricks.com/spark/developer-resources workshops: databricks.com/spark/training
  • 49. MOOCs: Anthony Joseph
 UC Berkeley begins Jun 2015 edx.org/course/uc-berkeleyx/uc- berkeleyx-cs100-1x- introduction-big-6181 Ameet Talwalkar
 UCLA begins Jun 2015 edx.org/course/uc-berkeleyx/ uc-berkeleyx-cs190-1x- scalable-machine-6066
  • 50. confs: Strata EU
 London, May 5-7
 strataconf.com/big-data-conference-uk-2015 GOTO Chicago
 Chicago, May 11-14
 gotocon.com/chicago-2015 Spark Summit 2015
 SF, Jun 15-17
 spark-summit.org Spark Summit EU
 (to be announced)
  • 51. books+videos: Fast Data Processing 
 with Spark
 Holden Karau
 Packt (2013)
 shop.oreilly.com/ product/ 9781782167068.do Spark in Action
 Chris Fregly
 Manning (2015)
 sparkinaction.com/ Learning Spark
 Holden Karau, 
 Andy Konwinski,
 Parick Wendell, 
 Matei Zaharia
 O’Reilly (2015)
 shop.oreilly.com/ product/ 0636920028512.do Intro to Apache Spark
 Paco Nathan
 O’Reilly (2015)
 shop.oreilly.com/ product/ 0636920036807.do Advanced Analytics with Spark
 Sandy Ryza, 
 Uri Laserson,
 Sean Owen, 
 Josh Wills
 O’Reilly (2014)
 shop.oreilly.com/ product/ 0636920035091.do
  • 52. presenter: Just Enough Math O’Reilly, 2014 justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do