SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Downloaden Sie, um offline zu lesen
Big Data Analytics Systems:
What Goes Around Comes Around
Reynold Xin, CS186 guest lecture @ Berkeley
Apr 9, 2015
Who am I?
Co-founder & architect @ Databricks
On-leave from PhD @ Berkeley AMPLab
Current world record holder in 100TB sorting (Daytona
GraySort Benchmark)
Transaction
Processing
(OLTP)
“User A bought item b”
Analytics
(OLAP)
“What is revenue each store
this year?”
Agenda
What is “Big Data” (BD)?
GFS, MapReduce, Hadoop, Spark
What’s different between BD and DB?
Assumption: you learned about parallel DB already.
What is “Big Data”?
Gartner’s Definition
“Big data” is high-volume, -velocity and -variety
information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making.
3 Vs of Big Data
Volume: data size
Velocity: rate of data coming in
Variety (most important V): data sources, formats,
workloads
“Big Data” can also refer to the tech stack
Many were pioneered by Google
Why didn’t Google just use
database systems?
Challenges
Data size growing (volume & velocity)
–  Processing has to scale out over large clusters
Complexity of analysis increasing (variety)
–  Massive ETL (web crawling)
–  Machine learning, graph processing
Examples
Google web index: 10+ PB
Types of data: HTML pages, PDFs, images, videos, …
Cost of 1 TB of disk: $50
Time to read 1 TB from disk: 6 hours (50 MB/s)
The Big Data Problem
Semi-/Un-structured data doesn’t fit well with
databases
Single machine can no longer process or even store all
the data!
Only solution is to distribute general storage &
processing over clusters.
GFS Assumptions
“Component failures are the norm rather than the
exception”
“Files are huge by traditional standards”
“Most files are mutated by appending new data rather
than overwriting existing data”
- GFS paper
File Splits
17
Large	
  File	
  
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001	
  
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001	
  
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001	
  
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001	
  
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001	
  
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001	
  
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001	
  
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001	
  
…	
  
	
  
6440MB	
  
Block	
  	
  
1	
  
Block	
  	
  
2	
  
Block	
  	
  
3	
  
Block	
  	
  
4	
  
Block	
  	
  
5	
  
Block	
  	
  
6	
  
Block	
  	
  
100	
  
Block	
  	
  
101	
  
64MB	
   64MB	
   64MB	
   64MB	
   64MB	
   64MB	
  
…	
  
64MB	
   40MB	
  
Block	
  	
  
1	
  
Block	
  	
  
2	
  
Let’s color-code them
Block	
  	
  
3	
  
Block	
  	
  
4	
  
Block	
  	
  
5	
  
Block	
  	
  
6	
  
Block	
  	
  
100	
  
Block	
  	
  
101	
  
e.g., Block Size = 64MB
Files are composed of set of blocks
•  Typically 64MB in size
•  Each block is stored as a separate file in the
local file system (e.g. NTFS)
Default placement policy:
–  First copy is written to the node creating the file (write affinity)
–  Second copy is written to a data node within the same rack
(to minimize cross-rack network traffic)
–  Third copy is written to a data node in a different rack
(to tolerate switch failures)
Node	
  5	
  Node	
  4	
  Node	
  3	
  Node	
  2	
  Node	
  1	
  
Block Placement
18
Block	
  	
  
1	
  
Block	
  	
  
3	
  
Block	
  	
  
2	
  
Block	
  	
  
1	
  
Block	
  	
  
3	
  
Block	
  	
  
2	
  
Block	
  	
  
3	
  
Block	
  	
  
2	
  
Block	
  	
  
1	
  
e.g., Replication factor = 3
Objec<ves:	
  load	
  balancing,	
  fast	
  access,	
  fault	
  tolerance	
  
GFS Architecture
19
NameNode	
   BackupNode	
  
DataNode	
   DataNode	
   DataNode	
   DataNode	
   DataNode	
  
(heartbeat, balancing, replication, etc.)
namespace backups
Failure types:
q  Disk errors and failures
q  DataNode failures
q  Switch/Rack failures
q  NameNode failures
q  Datacenter failures
Failures, Failures, Failures
GFS paper: “Component failures are the norm rather
than the exception.”
20
NameNode	
  
DataNode	
  
GFS Summary
Store large, immutable (append-only) files
Scalability
Reliability
Availability
Google Datacenter
How do we program this thing?
Traditional Network Programming
Message-passing between nodes (MPI, RPC, etc)
Really hard to do at scale:
–  How to split problem across nodes?
•  Important to consider network and data locality
–  How to deal with failures?
•  If a typical server fails every 3 years, a 10,000-node cluster sees 10
faults/day!
–  Even without failures: stragglers (a node is slow)
Almost nobody does this!
Data-Parallel Models
Restrict the programming interface so that the system
can do more automatically
“Here’s an operation, run it on all of the data”
–  I don’t care where it runs (you schedule that)
–  In fact, feel free to run it twice on different nodes
Does this sound familiar?
MapReduce
First widely popular programming model for data-
intensive apps on clusters
Published by Google in 2004
–  Processes 20 PB of data / day
Popularized by open-source Hadoop project
MapReduce Programming Model
Data type: key-value records
Map function:
(Kin, Vin) -> list(Kinter, Vinter)
Reduce function:
(Kinter, list(Vinter)) -> list(Kout, Vout)
Hello World of Big Data: Word Count
the	
  quick	
  
brown	
  fox	
  
the	
  fox	
  ate	
  
the	
  mouse	
  
how	
  now	
  
brown	
  
cow	
  
Map	
  
Map	
  
Map	
  
Reduce	
  
Reduce	
  
brown,	
  2	
  
fox,	
  2	
  
how,	
  1	
  
now,	
  1	
  
the,	
  3	
  
ate,	
  1	
  
cow,	
  1	
  
mouse,	
  1	
  
quick,	
  1	
  
the,	
  1	
  
brown,	
  1	
  
fox,	
  1	
  
quick,	
  1	
  
the,	
  1	
  
fox,	
  1	
  
the,	
  1	
  
how,	
  1	
  
now,	
  1	
  
brown,	
  1	
  
ate,	
  1	
  
mouse,	
  1	
  
cow,	
  1	
  
Input	
   Map	
   Shuffle	
  &	
  Sort	
   Reduce	
   Output	
  
MapReduce Execution
Automatically split work into many small tasks
Send map tasks to nodes based on data locality
Load-balance dynamically as tasks finish
MapReduce Fault Recovery
If a task fails, re-run it and re-fetch its input
–  Requirement: input is immutable
If a node fails, re-run its map tasks on others
–  Requirement: task result is deterministic & side effect is
idempotent
If a task is slow, launch 2nd copy on other node
–  Requirement: same as above
MapReduce Summary
By providing a data-parallel model, MapReduce greatly
simplified cluster computing:
–  Automatic division of job into tasks
–  Locality-aware scheduling
–  Load balancing
–  Recovery from failures & stragglers
Also flexible enough to model a lot of workloads…
Hadoop
Open-sourced by Yahoo!
–  modeled after the two Google papers
Two components:
–  Storage: Hadoop Distributed File System (HDFS)
–  Compute: Hadoop MapReduce
Sometimes synonymous with Big Data
Why didn’t Google just use databases?
Cost
–  database vendors charge by $/TB or $/core
Scale
–  no database systems at the time had been demonstrated to work at that scale (#
machines or data size)
Data Model
–  A lot of semi-/un-structured data: web pages, images, videos
Compute Model
–  SQL not expressive (or “simple”) enough for many Google tasks (e.g. crawl the web,
build inverted index, log analysis on unstructured data)
Not-invented-here
MapReduce Programmability
Most real applications require multiple MR steps
–  Google indexing pipeline: 21 steps
–  Analytics queries (e.g. count clicks & top K): 2 – 5 steps
–  Iterative algorithms (e.g. PageRank): 10’s of steps
Multi-step jobs create spaghetti code
–  21 MR steps -> 21 mapper and reducer classes
–  Lots of boilerplate code per step
Higher Level Frameworks
SELECT count(*) FROM users
A = load 'foo';
B = group A all;
C = foreach B generate COUNT(A);
In reality, 90+% of MR jobs are generated by Hive SQL
SQL on Hadoop (Hive)
Meta	
  
store	
  
HDFS	
  
	
  
	
  	
  Client	
  
Driver	
  
SQL	
  
Parser	
  
Query	
  
Op<mizer	
  
Physical	
  Plan	
  
Execu<on	
  
CLI	
   JDBC	
  
MapReduce	
  
Problems with MapReduce
1.  Programmability
–  We covered this earlier …
2.  Performance
–  Each MR job writes all output to disk
–  Lack of more primitives such as data broadcast
Spark
Started in Berkeley AMPLab in 2010; addresses MR problems.
Programmability: DSL in Scala / Java / Python
–  Functional transformations on collections
–  5 – 10X less code than MR
–  Interactive use from Scala / Python REPL
–  You can unit test Spark programs!
Performance:
–  General DAG of tasks (i.e. multi-stage MR)
–  Richer primitives: in-memory cache, torrent broadcast, etc
–  Can run 10 – 100X faster than MR
Programmability
#include "mapreduce/mapreduce.h"
// User’s map function
class SplitWords: public Mapper {
public:
virtual void Map(const MapInput& input)
{
const string& text = input.value();
const int n = text.size();
for (int i = 0; i < n; ) {
// Skip past leading whitespace
while (i < n && isspace(text[i]))
i++;
// Find word end
int start = i;
while (i < n && !isspace(text[i]))
i++;
if (start < i)
Emit(text.substr(
start,i-start),"1");
}
}
};
REGISTER_MAPPER(SplitWords);
// User’s reduce function
class Sum: public Reducer {
public:
virtual void Reduce(ReduceInput* input)
{
// Iterate over all entries with the
// same key and add the values
int64 value = 0;
while (!input->done()) {
value += StringToInt(
input->value());
input->NextValue();
}
// Emit sum for input->key()
Emit(IntToString(value));
}
};
REGISTER_REDUCER(Sum);
int main(int argc, char** argv) {
ParseCommandLineFlags(argc, argv);
MapReduceSpecification spec;
for (int i = 1; i < argc; i++) {
MapReduceInput* in= spec.add_input();
in->set_format("text");
in->set_filepattern(argv[i]);
in->set_mapper_class("SplitWords");
}
// Specify the output files
MapReduceOutput* out = spec.output();
out->set_filebase("/gfs/test/freq");
out->set_num_tasks(100);
out->set_format("text");
out->set_reducer_class("Sum");
// Do partial sums within map
out->set_combiner_class("Sum");
// Tuning parameters
spec.set_machines(2000);
spec.set_map_megabytes(100);
spec.set_reduce_megabytes(100);
// Now run it
MapReduceResult result;
if (!MapReduce(spec, &result)) abort();
return 0;
}
Full Google WordCount:
Programmability
Spark WordCount:
val file = spark.textFile(“hdfs://...”)
val counts = file.flatMap(line => line.split(“ ”))

.map(word => (word, 1))

.reduceByKey(_ + _)



counts.save(“out.txt”)
Performance
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Clustering
Hadoop MR
Spark
Time per Iteration (s)
43
Performance
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
Spark Summary
Spark generalizes MapReduce to provide:
–  High performance
–  Better programmability
–  (consequently) a unified engine
The most active open source data project
Note: not a scientific comparison.
Beyond Hadoop Users
47
Spark early adopters
Data Engineers
Data Scientists
Statisticians
R users
PyData …
Users
Understands
MapReduce
& functional APIs
48
DataFrames in Spark
Distributed collection of data grouped into named
columns (i.e. RDD with schema)
DSL designed for common tasks
–  Metadata
–  Sampling
–  Project, filter, aggregation, join, …
–  UDFs
Available in Python, Scala, Java, and R (via SparkR)
49
Plan Optimization & Execution
50
DataFrames and SQL share the same optimization/execution pipeline
Maximize code reuse & share optimization efforts
SQL	
  AST	
  
DataFrame	
  
Unresolved	
  
Logical	
  Plan	
  
Logical	
  Plan	
  
Op<mized	
  
Logical	
  Plan	
  
Physical	
  Plans	
  Physical	
  Plans	
   RDDs	
  
Selected	
  
Physical	
  Plan	
  
Analysis	
  
Logical	
  
Op<miza<on	
  
Physical	
  
Planning	
  
Cost	
  Model	
  
Physical	
  Plans	
  
Code	
  
Genera<on	
  
Catalog	
  
Our Experience So Far
SQL is wildly popular and important
–  100% of Databricks customers use some SQL
Schema is very useful
–  Most data pipelines, even the ones that start with unstructured
data, end up having some implicit structure
–  Key-value too limited
–  That said, semi-/un-structured support is paramount
Separation of logical vs physical plan
–  Important for performance optimizations (e.g. join selection)
Return of SQL
Why SQL?
Almost everybody knows SQL
Easier to write than MR (even Spark) for analytic queries
Lingua franca for data analysis tools (business
intelligence, etc)
Schema is useful (key-value is limited)
What’s really different?
SQL on BD (Hadoop/Spark) vs SQL in DB?
Two perspectives:
1.  Flexibility in data and compute model
2.  Fault-tolerance
Traditional Database Systems
(Monolithic)
Physical Execution Engine (Dataflow)
SQL
Applications
One way (SQL) in/out and data must be structured
Data-Parallel Engine (Spark, MR)
SQL DataFrame M.L.
Big Data Ecosystems
(Layered)
Decoupled storage, low vs high level compute
Structured, semi-structured, unstructured data
Schema on read, schema on write
Evolution of Database Systems
Decouple Storage from Compute
Physical Execution Engine (Dataflow)
SQL
Applications
Physical Execution Engine (Dataflow)
SQL
Applications
Traditional 2014 - 2015
IBM Big Insight
Oracle
EMC Greenplum
…
support for nested data (e.g. JSON)
Perspective 2: Fault Tolerance
Database systems: coarse-grained fault tolerance
–  If fault happens, fail the query (or rerun from the beginning)
MapReduce: fine-grained fault tolerance
–  Rerun failed tasks, not the entire query
We were writing it to 48,000 hard drives (we did not use the full capacity of these
disks, though), and every time we ran our sort, at least one of our disks managed
to break (this is not surprising at all given the duration of the test, the number of
disks involved, and the expected lifetime of hard disks).
MapReduce
Checkpointing-based Fault Tolerance
Checkpoint all intermediate output
–  Replicate them to multiple nodes
–  Upon failure, recover from checkpoints
–  High cost of fault-tolerance (disk and network I/O)
Necessary for PBs of data on thousands of machines
What if I have 20 nodes and my query takes only 1 min?
Spark
Unified Checkpointing and Rerun
Simple idea: remember the lineage to create an RDD,
and recompute from last checkpoint.
When fault happens, query still continues.
When faults are rare, no need to checkpoint, i.e. cost of
fault-tolerance is low.
What’s Really Different?
Monolithic vs layered storage & compute
–  DB becoming more layered
–  Although “Big Data” still far more flexible than DB
Fault-tolerance
–  DB mostly coarse-grained fault-tolerance, assuming faults
are rare
–  Big Data mostly fine-grained fault-tolerance, with new
strategies in Spark to mitigate faults at low cost
Convergence
DB evolving towards BD
–  Decouple storage from compute
–  Provide alternative programming models
–  Semi-structured data (JSON, XML, etc)
BD evolving towards DB
–  Schema beyond key-value
–  Separation of logical vs physical plan
–  Query optimization
–  More optimized storage formats
Thanks & Questions?
Reynold Xin
rxin@databricks.com
@rxin
Acknowledgement
Some slides taken from:
Zaharia. Processing Big Data with Small Programs
Franklin. SQL, NoSQL, NewSQL? CS186 2013

Weitere ähnliche Inhalte

Was ist angesagt?

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Spark Summit
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Spark streaming
Spark streamingSpark streaming
Spark streamingWhiteklay
 
Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913jasonfrantz
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopTao Feng
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 

Was ist angesagt? (20)

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 

Andere mochten auch

Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Big data analytics
Big data analyticsBig data analytics
Big data analyticsRavi Teja
 
4th EDUCON ASIA CONFERENCE
4th EDUCON ASIA CONFERENCE4th EDUCON ASIA CONFERENCE
4th EDUCON ASIA CONFERENCEKahren Navarro
 
ConnectingUp Keynote: Leading Change
ConnectingUp Keynote: Leading ChangeConnectingUp Keynote: Leading Change
ConnectingUp Keynote: Leading ChangeHolly Ross
 
The New Data Imperative
The New Data ImperativeThe New Data Imperative
The New Data ImperativeHolly Ross
 
The fourth paradigm in safety
The fourth paradigm in safetyThe fourth paradigm in safety
The fourth paradigm in safetyJohan Roels
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginningsDaniel Leon
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
The Effectiveness of a Proposed Training Program is based on Collaborative Cl...
The Effectiveness of a Proposed Training Program is based on Collaborative Cl...The Effectiveness of a Proposed Training Program is based on Collaborative Cl...
The Effectiveness of a Proposed Training Program is based on Collaborative Cl...Taghreed Alrehaili
 
Real-Time-Analytics mit Spark und Cassandra
Real-Time-Analytics mit Spark und CassandraReal-Time-Analytics mit Spark und Cassandra
Real-Time-Analytics mit Spark und CassandraThomas Mann
 

Andere mochten auch (20)

Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
4th EDUCON ASIA CONFERENCE
4th EDUCON ASIA CONFERENCE4th EDUCON ASIA CONFERENCE
4th EDUCON ASIA CONFERENCE
 
Epsilon
EpsilonEpsilon
Epsilon
 
Big data
Big dataBig data
Big data
 
ConnectingUp Keynote: Leading Change
ConnectingUp Keynote: Leading ChangeConnectingUp Keynote: Leading Change
ConnectingUp Keynote: Leading Change
 
The New Data Imperative
The New Data ImperativeThe New Data Imperative
The New Data Imperative
 
Data Science and the Fourth Paradigm by Torben Bach Pedersen
Data Science and the Fourth Paradigm by Torben Bach PedersenData Science and the Fourth Paradigm by Torben Bach Pedersen
Data Science and the Fourth Paradigm by Torben Bach Pedersen
 
The fourth paradigm in safety
The fourth paradigm in safetyThe fourth paradigm in safety
The fourth paradigm in safety
 
The Fourth Paradigm Book
The Fourth Paradigm BookThe Fourth Paradigm Book
The Fourth Paradigm Book
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
The Effectiveness of a Proposed Training Program is based on Collaborative Cl...
The Effectiveness of a Proposed Training Program is based on Collaborative Cl...The Effectiveness of a Proposed Training Program is based on Collaborative Cl...
The Effectiveness of a Proposed Training Program is based on Collaborative Cl...
 
Real-Time-Analytics mit Spark und Cassandra
Real-Time-Analytics mit Spark und CassandraReal-Time-Analytics mit Spark und Cassandra
Real-Time-Analytics mit Spark und Cassandra
 
Analytics for startups
Analytics for startupsAnalytics for startups
Analytics for startups
 
Big Data mit Apache Hadoop
Big Data mit Apache HadoopBig Data mit Apache Hadoop
Big Data mit Apache Hadoop
 
Meetup Systemd vs sysvinit
Meetup Systemd vs sysvinitMeetup Systemd vs sysvinit
Meetup Systemd vs sysvinit
 

Ähnlich wie (Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around

Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 

Ähnlich wie (Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around (20)

Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Spark
SparkSpark
Spark
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 

Kürzlich hochgeladen

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 

Kürzlich hochgeladen (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around

  • 1. Big Data Analytics Systems: What Goes Around Comes Around Reynold Xin, CS186 guest lecture @ Berkeley Apr 9, 2015
  • 2. Who am I? Co-founder & architect @ Databricks On-leave from PhD @ Berkeley AMPLab Current world record holder in 100TB sorting (Daytona GraySort Benchmark)
  • 3. Transaction Processing (OLTP) “User A bought item b” Analytics (OLAP) “What is revenue each store this year?”
  • 4. Agenda What is “Big Data” (BD)? GFS, MapReduce, Hadoop, Spark What’s different between BD and DB? Assumption: you learned about parallel DB already.
  • 5.
  • 6.
  • 7. What is “Big Data”?
  • 8. Gartner’s Definition “Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
  • 9. 3 Vs of Big Data Volume: data size Velocity: rate of data coming in Variety (most important V): data sources, formats, workloads
  • 10. “Big Data” can also refer to the tech stack Many were pioneered by Google
  • 11. Why didn’t Google just use database systems?
  • 12. Challenges Data size growing (volume & velocity) –  Processing has to scale out over large clusters Complexity of analysis increasing (variety) –  Massive ETL (web crawling) –  Machine learning, graph processing
  • 13. Examples Google web index: 10+ PB Types of data: HTML pages, PDFs, images, videos, … Cost of 1 TB of disk: $50 Time to read 1 TB from disk: 6 hours (50 MB/s)
  • 14. The Big Data Problem Semi-/Un-structured data doesn’t fit well with databases Single machine can no longer process or even store all the data! Only solution is to distribute general storage & processing over clusters.
  • 15.
  • 16. GFS Assumptions “Component failures are the norm rather than the exception” “Files are huge by traditional standards” “Most files are mutated by appending new data rather than overwriting existing data” - GFS paper
  • 17. File Splits 17 Large  File   1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001   1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001   1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001   1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001   1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001   1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001   1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001   1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001   …     6440MB   Block     1   Block     2   Block     3   Block     4   Block     5   Block     6   Block     100   Block     101   64MB   64MB   64MB   64MB   64MB   64MB   …   64MB   40MB   Block     1   Block     2   Let’s color-code them Block     3   Block     4   Block     5   Block     6   Block     100   Block     101   e.g., Block Size = 64MB Files are composed of set of blocks •  Typically 64MB in size •  Each block is stored as a separate file in the local file system (e.g. NTFS)
  • 18. Default placement policy: –  First copy is written to the node creating the file (write affinity) –  Second copy is written to a data node within the same rack (to minimize cross-rack network traffic) –  Third copy is written to a data node in a different rack (to tolerate switch failures) Node  5  Node  4  Node  3  Node  2  Node  1   Block Placement 18 Block     1   Block     3   Block     2   Block     1   Block     3   Block     2   Block     3   Block     2   Block     1   e.g., Replication factor = 3 Objec<ves:  load  balancing,  fast  access,  fault  tolerance  
  • 19. GFS Architecture 19 NameNode   BackupNode   DataNode   DataNode   DataNode   DataNode   DataNode   (heartbeat, balancing, replication, etc.) namespace backups
  • 20. Failure types: q  Disk errors and failures q  DataNode failures q  Switch/Rack failures q  NameNode failures q  Datacenter failures Failures, Failures, Failures GFS paper: “Component failures are the norm rather than the exception.” 20 NameNode   DataNode  
  • 21. GFS Summary Store large, immutable (append-only) files Scalability Reliability Availability
  • 22. Google Datacenter How do we program this thing?
  • 23. Traditional Network Programming Message-passing between nodes (MPI, RPC, etc) Really hard to do at scale: –  How to split problem across nodes? •  Important to consider network and data locality –  How to deal with failures? •  If a typical server fails every 3 years, a 10,000-node cluster sees 10 faults/day! –  Even without failures: stragglers (a node is slow) Almost nobody does this!
  • 24. Data-Parallel Models Restrict the programming interface so that the system can do more automatically “Here’s an operation, run it on all of the data” –  I don’t care where it runs (you schedule that) –  In fact, feel free to run it twice on different nodes Does this sound familiar?
  • 25.
  • 26. MapReduce First widely popular programming model for data- intensive apps on clusters Published by Google in 2004 –  Processes 20 PB of data / day Popularized by open-source Hadoop project
  • 27. MapReduce Programming Model Data type: key-value records Map function: (Kin, Vin) -> list(Kinter, Vinter) Reduce function: (Kinter, list(Vinter)) -> list(Kout, Vout)
  • 28. Hello World of Big Data: Word Count the  quick   brown  fox   the  fox  ate   the  mouse   how  now   brown   cow   Map   Map   Map   Reduce   Reduce   brown,  2   fox,  2   how,  1   now,  1   the,  3   ate,  1   cow,  1   mouse,  1   quick,  1   the,  1   brown,  1   fox,  1   quick,  1   the,  1   fox,  1   the,  1   how,  1   now,  1   brown,  1   ate,  1   mouse,  1   cow,  1   Input   Map   Shuffle  &  Sort   Reduce   Output  
  • 29. MapReduce Execution Automatically split work into many small tasks Send map tasks to nodes based on data locality Load-balance dynamically as tasks finish
  • 30. MapReduce Fault Recovery If a task fails, re-run it and re-fetch its input –  Requirement: input is immutable If a node fails, re-run its map tasks on others –  Requirement: task result is deterministic & side effect is idempotent If a task is slow, launch 2nd copy on other node –  Requirement: same as above
  • 31. MapReduce Summary By providing a data-parallel model, MapReduce greatly simplified cluster computing: –  Automatic division of job into tasks –  Locality-aware scheduling –  Load balancing –  Recovery from failures & stragglers Also flexible enough to model a lot of workloads…
  • 32. Hadoop Open-sourced by Yahoo! –  modeled after the two Google papers Two components: –  Storage: Hadoop Distributed File System (HDFS) –  Compute: Hadoop MapReduce Sometimes synonymous with Big Data
  • 33.
  • 34. Why didn’t Google just use databases? Cost –  database vendors charge by $/TB or $/core Scale –  no database systems at the time had been demonstrated to work at that scale (# machines or data size) Data Model –  A lot of semi-/un-structured data: web pages, images, videos Compute Model –  SQL not expressive (or “simple”) enough for many Google tasks (e.g. crawl the web, build inverted index, log analysis on unstructured data) Not-invented-here
  • 35. MapReduce Programmability Most real applications require multiple MR steps –  Google indexing pipeline: 21 steps –  Analytics queries (e.g. count clicks & top K): 2 – 5 steps –  Iterative algorithms (e.g. PageRank): 10’s of steps Multi-step jobs create spaghetti code –  21 MR steps -> 21 mapper and reducer classes –  Lots of boilerplate code per step
  • 36. Higher Level Frameworks SELECT count(*) FROM users A = load 'foo'; B = group A all; C = foreach B generate COUNT(A); In reality, 90+% of MR jobs are generated by Hive SQL
  • 37. SQL on Hadoop (Hive) Meta   store   HDFS        Client   Driver   SQL   Parser   Query   Op<mizer   Physical  Plan   Execu<on   CLI   JDBC   MapReduce  
  • 38. Problems with MapReduce 1.  Programmability –  We covered this earlier … 2.  Performance –  Each MR job writes all output to disk –  Lack of more primitives such as data broadcast
  • 39. Spark Started in Berkeley AMPLab in 2010; addresses MR problems. Programmability: DSL in Scala / Java / Python –  Functional transformations on collections –  5 – 10X less code than MR –  Interactive use from Scala / Python REPL –  You can unit test Spark programs! Performance: –  General DAG of tasks (i.e. multi-stage MR) –  Richer primitives: in-memory cache, torrent broadcast, etc –  Can run 10 – 100X faster than MR
  • 40. Programmability #include "mapreduce/mapreduce.h" // User’s map function class SplitWords: public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while (i < n && isspace(text[i])) i++; // Find word end int start = i; while (i < n && !isspace(text[i])) i++; if (start < i) Emit(text.substr( start,i-start),"1"); } } }; REGISTER_MAPPER(SplitWords); // User’s reduce function class Sum: public Reducer { public: virtual void Reduce(ReduceInput* input) { // Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt( input->value()); input->NextValue(); } // Emit sum for input->key() Emit(IntToString(value)); } }; REGISTER_REDUCER(Sum); int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); MapReduceSpecification spec; for (int i = 1; i < argc; i++) { MapReduceInput* in= spec.add_input(); in->set_format("text"); in->set_filepattern(argv[i]); in->set_mapper_class("SplitWords"); } // Specify the output files MapReduceOutput* out = spec.output(); out->set_filebase("/gfs/test/freq"); out->set_num_tasks(100); out->set_format("text"); out->set_reducer_class("Sum"); // Do partial sums within map out->set_combiner_class("Sum"); // Tuning parameters spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); // Now run it MapReduceResult result; if (!MapReduce(spec, &result)) abort(); return 0; } Full Google WordCount:
  • 41. Programmability Spark WordCount: val file = spark.textFile(“hdfs://...”) val counts = file.flatMap(line => line.split(“ ”))
 .map(word => (word, 1))
 .reduceByKey(_ + _)
 
 counts.save(“out.txt”)
  • 42. Performance 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop MR Spark Time per Iteration (s)
  • 43. 43 Performance Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1PB in 4 hours
  • 44.
  • 45. Spark Summary Spark generalizes MapReduce to provide: –  High performance –  Better programmability –  (consequently) a unified engine The most active open source data project
  • 46. Note: not a scientific comparison.
  • 47. Beyond Hadoop Users 47 Spark early adopters Data Engineers Data Scientists Statisticians R users PyData … Users Understands MapReduce & functional APIs
  • 48. 48
  • 49. DataFrames in Spark Distributed collection of data grouped into named columns (i.e. RDD with schema) DSL designed for common tasks –  Metadata –  Sampling –  Project, filter, aggregation, join, … –  UDFs Available in Python, Scala, Java, and R (via SparkR) 49
  • 50. Plan Optimization & Execution 50 DataFrames and SQL share the same optimization/execution pipeline Maximize code reuse & share optimization efforts SQL  AST   DataFrame   Unresolved   Logical  Plan   Logical  Plan   Op<mized   Logical  Plan   Physical  Plans  Physical  Plans   RDDs   Selected   Physical  Plan   Analysis   Logical   Op<miza<on   Physical   Planning   Cost  Model   Physical  Plans   Code   Genera<on   Catalog  
  • 51. Our Experience So Far SQL is wildly popular and important –  100% of Databricks customers use some SQL Schema is very useful –  Most data pipelines, even the ones that start with unstructured data, end up having some implicit structure –  Key-value too limited –  That said, semi-/un-structured support is paramount Separation of logical vs physical plan –  Important for performance optimizations (e.g. join selection)
  • 53.
  • 54. Why SQL? Almost everybody knows SQL Easier to write than MR (even Spark) for analytic queries Lingua franca for data analysis tools (business intelligence, etc) Schema is useful (key-value is limited)
  • 55. What’s really different? SQL on BD (Hadoop/Spark) vs SQL in DB? Two perspectives: 1.  Flexibility in data and compute model 2.  Fault-tolerance
  • 56. Traditional Database Systems (Monolithic) Physical Execution Engine (Dataflow) SQL Applications One way (SQL) in/out and data must be structured
  • 57. Data-Parallel Engine (Spark, MR) SQL DataFrame M.L. Big Data Ecosystems (Layered) Decoupled storage, low vs high level compute Structured, semi-structured, unstructured data Schema on read, schema on write
  • 58. Evolution of Database Systems Decouple Storage from Compute Physical Execution Engine (Dataflow) SQL Applications Physical Execution Engine (Dataflow) SQL Applications Traditional 2014 - 2015 IBM Big Insight Oracle EMC Greenplum … support for nested data (e.g. JSON)
  • 59. Perspective 2: Fault Tolerance Database systems: coarse-grained fault tolerance –  If fault happens, fail the query (or rerun from the beginning) MapReduce: fine-grained fault tolerance –  Rerun failed tasks, not the entire query
  • 60. We were writing it to 48,000 hard drives (we did not use the full capacity of these disks, though), and every time we ran our sort, at least one of our disks managed to break (this is not surprising at all given the duration of the test, the number of disks involved, and the expected lifetime of hard disks).
  • 61. MapReduce Checkpointing-based Fault Tolerance Checkpoint all intermediate output –  Replicate them to multiple nodes –  Upon failure, recover from checkpoints –  High cost of fault-tolerance (disk and network I/O) Necessary for PBs of data on thousands of machines What if I have 20 nodes and my query takes only 1 min?
  • 62. Spark Unified Checkpointing and Rerun Simple idea: remember the lineage to create an RDD, and recompute from last checkpoint. When fault happens, query still continues. When faults are rare, no need to checkpoint, i.e. cost of fault-tolerance is low.
  • 63. What’s Really Different? Monolithic vs layered storage & compute –  DB becoming more layered –  Although “Big Data” still far more flexible than DB Fault-tolerance –  DB mostly coarse-grained fault-tolerance, assuming faults are rare –  Big Data mostly fine-grained fault-tolerance, with new strategies in Spark to mitigate faults at low cost
  • 64. Convergence DB evolving towards BD –  Decouple storage from compute –  Provide alternative programming models –  Semi-structured data (JSON, XML, etc) BD evolving towards DB –  Schema beyond key-value –  Separation of logical vs physical plan –  Query optimization –  More optimized storage formats
  • 65. Thanks & Questions? Reynold Xin rxin@databricks.com @rxin
  • 66. Acknowledgement Some slides taken from: Zaharia. Processing Big Data with Small Programs Franklin. SQL, NoSQL, NewSQL? CS186 2013