SlideShare ist ein Scribd-Unternehmen logo
1 von 39
MLlib: Scalable Machine Learning on Spark
Xiangrui Meng
1
Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao
Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi,
Joseph Gonzalez, Michael Franklin, Michael I. Jordan, Tim Kraska, etc.
What is MLlib?
2
What is MLlib?
MLlib is a Spark subproject providing machine learning
primitives:
• initial contribution from AMPLab, UC Berkeley
• shipped with Spark since version 0.8
• 33 contributors (March 2014)
3
What is MLlib?
Algorithms:
• classification: logistic regression, linear support vector machine
(SVM), naive Bayes
• regression: generalized linear regression (GLM)
• collaborative filtering: alternating least squares (ALS)
• clustering: k-means
• decomposition: singular value decomposition (SVD), principal
component analysis (PCA)
4
Why MLlib?
5
scikit-learn?
Algorithms:
• classification: SVM, nearest neighbors, random forest, …
• regression: support vector regression (SVR), ridge regression,
Lasso, logistic regression, …
• clustering: k-means, spectral clustering, …
• decomposition: PCA, non-negative matrix factorization (NMF),
independent component analysis (ICA), …
6
Mahout?
Algorithms:
• classification: logistic regression, naive Bayes, random forest, …
• collaborative filtering: ALS, …
• clustering: k-means, fuzzy k-means, …
• decomposition: SVD, randomized SVD, …
7
Vowpal Wabbit?H2O?
R?
MATLAB?
Mahout?
Weka?
scikit-learn?
LIBLINEAR?
8
GraphLab?
Why MLlib?
9
• It is built on Apache Spark, a fast and general
engine for large-scale data processing.
• Run programs up to 100x faster than
Hadoop MapReduce in memory, or
10x faster on disk.
• Write applications quickly in Java, Scala, or Python.
Why MLlib?
10
Word count
val counts = sc.textFile("hdfs://...")
.flatMap(line => line.split(" "))
.map(word => (word, 1L))
.reduceByKey(_ + _)
Gradient descent
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.zeros(d)
for (i <- 1 to numIterations) {
val gradient = points.map { p =>
(1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x
).reduce(_ + _)
w -= alpha * gradient
}
12
// Load and parse the data.
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache()
// Cluster the data into two classes using KMeans.
val clusters = KMeans.train(parsedData, 2, numIterations = 20)
// Compute the sum of squared errors.
val cost = clusters.computeCost(parsedData)
println("Sum of squared errors = " + cost)
k-means (scala)
13
k-means (python)
# Load and parse the data
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line:
array([float(x) for x in line.split(' ‘)])).cache()
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,
runs = 1, initialization_mode = "kmeans||")
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
cost = parsedData.map(lambda point: error(point))
.reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
14
Dimension reduction
+ k-means
// compute principal components
val points: RDD[Vector] = ...
val mat = RowMatrix(points)
val pc = mat.computePrincipalComponents(20)
// project points to a low-dimensional space
val projected = mat.multiply(pc).rows
// train a k-means model on the projected data
val model = KMeans.train(projected, 10)
Why MLlib?
• It ships with Spark as
a standard component.
16
Spark community
One of the largest open source projects in big data:
• 170+ developers contributing
• 30+ companies contributing
• 400+ discussions per month on the mailing list
Out for dinner?
• Search for a restaurant and make a reservation.
• Start navigation.
• Food looks good? Take a photo and share.
18
Why smartphone?
Out for dinner?
• Search for a restaurant and make a reservation. (Yellow Pages?)
• Start navigation. (GPS?)
• Food looks good? Take a photo and share. (Camera?)
19
Why MLlib?
A special-purpose device may be better at one aspect
than a general-purpose device. But the cost of context
switching is high:
• different languages or APIs
• different data formats
• different tuning tricks
20
Spark SQL + MLlib
// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingTable = sql("""
SELECT e.action,
u.age,
u.latitude,
u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
// Since `sql` returns an RDD, the results of the above
// query can be easily used in MLlib.
val training = trainingTable.map { row =>
val features = Vectors.dense(row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model = SVMWithSGD.train(training)
Streaming + MLlib
// collect tweets using streaming
// train a k-means model
val model: KMmeansModel = ...
// apply model to filter tweets
val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0)))
val statuses = tweets.map(_.getText)
val filteredTweets =
statuses.filter(t => model.predict(featurize(t)) == clusterNumber)
// print tweets within this particular cluster
filteredTweets.print()
GraphX + MLlib
// assemble link graph
val graph = Graph(pages, links)
val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices
// load page labels (spam or not) and content features
val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ...
val training: RDD[LabeledPoint] =
labelAndFeatures.join(pageRank).map {
case (id, ((label, features), pageRank)) =>
LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank))
}
// train a spam detector using logistic regression
val model = LogisticRegressionWithSGD.train(training)
Why MLlib?
• Built on Spark’s lightweight yet powerful APIs.
• Spark’s open source community.
• Seamless integration with Spark’s other components.
24
• Comparable to or even better than other libraries
specialized in large-scale machine learning.
Logistic regression
25
Logistic regression - weak scaling
• Full dataset: 200K images, 160K dense features.
• Similar weak scaling.
• MLlib within a factor of 2 of VW’s wall-clock time.
MLlib
MLlib
26
Logistic regression - strong scaling
• Fixed Dataset: 50K images, 160K dense features.
• MLlib exhibits better scaling properties.
• MLlib is faster than VW with 16 and 32 machines.
MLlib
MLlib
27
Collaborative filtering
28
Collaborative filtering
• Recover a rating matrix from a
subset of its entries.?
?
?
?
?
29
30
Alternating least squares
ALS - wall-clock time
• Dataset: scaled version of Netflix data (9X in size).
• Cluster: 9 machines.
• MLlib is an order of magnitude faster than Mahout.
• MLlib is within factor of 2 of GraphLab.
System Wall-clock time (seconds)
MATLAB 15443
Mahout 4206
GraphLab 291
MLlib 481
31
Implementation of ALS
• broadcast everything
• data parallel
• fully parallel
32
Broadcast everything
• Master loads (small)
data file and initializes
models.
• Master broadcasts
data and initial models.
• At each iteration,
updated models are
broadcast again.
• Works OK for small
data.
• Lots of communication
overhead - doesn’t
scale well.
• Ships with Spark
ExamplesMaster
Workers
Ratings
Movie
Factors
User
Factors
Data parallel
• Workers load data
• Master broadcasts
initial models
• At each iteration,
updated models are
broadcast again
• Much better scaling
• Works on large
datasets
• Works well for smaller
models. (low K)
Master
Workers
Ratings
Movie
Factors
User
Factors
Ratings
Ratings
Ratings
Fully parallel
• Workers load data
• Models are
instantiated at workers.
• At each iteration,
models are shared via
join between workers.
• Much better scalability.
• Works on large
datasets
• Works on big models
(higher K)
Master
Workers
RatingsMovie
FactorsUser
Factors
RatingsMovie
FactorsUser
Factors
RatingsMovie
FactorsUser
Factors
RatingsMovie
FactorsUser
Factors
Implementation of ALS
• broadcast everything
• data parallel
• fully parallel
• block-wise parallel
• Users/products are partitioned into blocks and join is
based on blocks instead of individual user/product.
36
New features in v1.0
• Sparse data support
• Classification and regression tree (CART)
• Tall-and-skinny SVD and PCA
• L-BFGS
• Binary classification evaluation
37
Contributors
Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai, Evan
Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden Karau, Hossein
Falaki, Jey Kottalam, Cheng Lian, Marek Kolodziej, Mark Hamstra,
Martin Jaggi, Martin Weindel, Matei Zaharia, Nick Pentreath, Patrick
Wendell, Prashant Sharma, Reynold Xin, Reza Zadeh, Sandy Ryza,
Sean Owen, Shivaram Venkataraman, Tor Myklebust, Xiangrui Meng,
Xinghao Pan, Xusen Yin, Jerry Shao, Ryan LeCompte
38
Interested?
• Website: http://spark.apache.org
• Tutorials: http://ampcamp.berkeley.edu
• Spark Summit: http://spark-summit.org
• Github: https://github.com/apache/spark
• Mailing lists: user@spark.apache.org
dev@spark.apache.org
39

Weitere ähnliche Inhalte

Was ist angesagt?

Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkDB Tsai
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...CloudxLab
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkDatabricks
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slidesMLconf
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark Summit
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 

Was ist angesagt? (20)

Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
Om nom nom nom
Om nom nom nomOm nom nom nom
Om nom nom nom
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
 
Josh Patterson MLconf slides
Josh Patterson MLconf slidesJosh Patterson MLconf slides
Josh Patterson MLconf slides
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 

Andere mochten auch

Presentation2 2
Presentation2 2Presentation2 2
Presentation2 2bobird
 
Cloud Based Digital Signage Framework
Cloud Based Digital Signage FrameworkCloud Based Digital Signage Framework
Cloud Based Digital Signage FrameworktruDigital Signage
 
How to book_transfer_quick_engl
How to book_transfer_quick_englHow to book_transfer_quick_engl
How to book_transfer_quick_englIlya Balakhnichev
 
2016.12.01 CP LaPrimaire.org créé 1er parti politique indépendant et éphémère
2016.12.01 CP LaPrimaire.org créé 1er parti politique indépendant et éphémère2016.12.01 CP LaPrimaire.org créé 1er parti politique indépendant et éphémère
2016.12.01 CP LaPrimaire.org créé 1er parti politique indépendant et éphémèreLaPrimaire.org
 
MLconf NYC Justin Basilico
MLconf NYC Justin BasilicoMLconf NYC Justin Basilico
MLconf NYC Justin BasilicoMLconf
 
Aplicações financeiras em períodos turbulentos
Aplicações financeiras em períodos turbulentosAplicações financeiras em períodos turbulentos
Aplicações financeiras em períodos turbulentosFundação Dom Cabral - FDC
 
Bifiform in News & Social Media - Analytics Research
Bifiform in News & Social Media - Analytics ResearchBifiform in News & Social Media - Analytics Research
Bifiform in News & Social Media - Analytics ResearchSemanticForce
 
Inovação e excelência operacional características da empresa ambidestra
Inovação e excelência operacional características da empresa ambidestraInovação e excelência operacional características da empresa ambidestra
Inovação e excelência operacional características da empresa ambidestraFundação Dom Cabral - FDC
 
BigML Fall 2016 Release
BigML Fall 2016 ReleaseBigML Fall 2016 Release
BigML Fall 2016 ReleaseBigML, Inc
 
Gett, Uber & Yandex Taxi в социальных медиа
Gett, Uber & Yandex Taxi в социальных медиаGett, Uber & Yandex Taxi в социальных медиа
Gett, Uber & Yandex Taxi в социальных медиаSemanticForce
 
Why User Experience?
Why User Experience?Why User Experience?
Why User Experience?Interakt
 
Vasona Networks @ Telco Vision 2013
Vasona Networks @ Telco Vision 2013Vasona Networks @ Telco Vision 2013
Vasona Networks @ Telco Vision 2013vasonanetworks
 
From Quantified Self to Quantified Surgery
From Quantified Self to Quantified SurgeryFrom Quantified Self to Quantified Surgery
From Quantified Self to Quantified SurgeryLarry Smarr
 
GeoVantage Agriculture Imagery
GeoVantage Agriculture ImageryGeoVantage Agriculture Imagery
GeoVantage Agriculture ImageryMatthew Herring
 
Inside3DPrinting_johnhornick
Inside3DPrinting_johnhornickInside3DPrinting_johnhornick
Inside3DPrinting_johnhornickMediabistro
 
JSONModel Lightning Talk
JSONModel Lightning TalkJSONModel Lightning Talk
JSONModel Lightning TalkMarin Todorov
 
How Ecosystem Economics™ Predicts the Winners in the Digital Age
How Ecosystem Economics™ Predicts the Winners in the Digital AgeHow Ecosystem Economics™ Predicts the Winners in the Digital Age
How Ecosystem Economics™ Predicts the Winners in the Digital AgeJulie Meyer
 

Andere mochten auch (20)

Presentation2 2
Presentation2 2Presentation2 2
Presentation2 2
 
Cloud Based Digital Signage Framework
Cloud Based Digital Signage FrameworkCloud Based Digital Signage Framework
Cloud Based Digital Signage Framework
 
How to book_transfer_quick_engl
How to book_transfer_quick_englHow to book_transfer_quick_engl
How to book_transfer_quick_engl
 
2016.12.01 CP LaPrimaire.org créé 1er parti politique indépendant et éphémère
2016.12.01 CP LaPrimaire.org créé 1er parti politique indépendant et éphémère2016.12.01 CP LaPrimaire.org créé 1er parti politique indépendant et éphémère
2016.12.01 CP LaPrimaire.org créé 1er parti politique indépendant et éphémère
 
MLconf NYC Justin Basilico
MLconf NYC Justin BasilicoMLconf NYC Justin Basilico
MLconf NYC Justin Basilico
 
Aplicações financeiras em períodos turbulentos
Aplicações financeiras em períodos turbulentosAplicações financeiras em períodos turbulentos
Aplicações financeiras em períodos turbulentos
 
Bifiform in News & Social Media - Analytics Research
Bifiform in News & Social Media - Analytics ResearchBifiform in News & Social Media - Analytics Research
Bifiform in News & Social Media - Analytics Research
 
Inovação e excelência operacional características da empresa ambidestra
Inovação e excelência operacional características da empresa ambidestraInovação e excelência operacional características da empresa ambidestra
Inovação e excelência operacional características da empresa ambidestra
 
BigML Fall 2016 Release
BigML Fall 2016 ReleaseBigML Fall 2016 Release
BigML Fall 2016 Release
 
6 dicas de livros para uma nova liderança
6 dicas de livros para uma nova liderança6 dicas de livros para uma nova liderança
6 dicas de livros para uma nova liderança
 
Relatorio Global de Competitividade 2016
Relatorio Global de Competitividade 2016Relatorio Global de Competitividade 2016
Relatorio Global de Competitividade 2016
 
Gett, Uber & Yandex Taxi в социальных медиа
Gett, Uber & Yandex Taxi в социальных медиаGett, Uber & Yandex Taxi в социальных медиа
Gett, Uber & Yandex Taxi в социальных медиа
 
Why User Experience?
Why User Experience?Why User Experience?
Why User Experience?
 
Cyber safety 101
Cyber safety 101Cyber safety 101
Cyber safety 101
 
Vasona Networks @ Telco Vision 2013
Vasona Networks @ Telco Vision 2013Vasona Networks @ Telco Vision 2013
Vasona Networks @ Telco Vision 2013
 
From Quantified Self to Quantified Surgery
From Quantified Self to Quantified SurgeryFrom Quantified Self to Quantified Surgery
From Quantified Self to Quantified Surgery
 
GeoVantage Agriculture Imagery
GeoVantage Agriculture ImageryGeoVantage Agriculture Imagery
GeoVantage Agriculture Imagery
 
Inside3DPrinting_johnhornick
Inside3DPrinting_johnhornickInside3DPrinting_johnhornick
Inside3DPrinting_johnhornick
 
JSONModel Lightning Talk
JSONModel Lightning TalkJSONModel Lightning Talk
JSONModel Lightning Talk
 
How Ecosystem Economics™ Predicts the Winners in the Digital Age
How Ecosystem Economics™ Predicts the Winners in the Digital AgeHow Ecosystem Economics™ Predicts the Winners in the Digital Age
How Ecosystem Economics™ Predicts the Winners in the Digital Age
 

Ähnlich wie MLconf NYC Xiangrui Meng

Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondXiangrui Meng
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedChao Chen
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
Device status anomaly detection
Device status anomaly detectionDevice status anomaly detection
Device status anomaly detectionDavid Tung
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
2014.06.24.what is ubix
2014.06.24.what is ubix2014.06.24.what is ubix
2014.06.24.what is ubixJim Cooley
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learningMax Kleiner
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 

Ähnlich wie MLconf NYC Xiangrui Meng (20)

Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Device status anomaly detection
Device status anomaly detectionDevice status anomaly detection
Device status anomaly detection
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
2014.06.24.what is ubix
2014.06.24.what is ubix2014.06.24.what is ubix
2014.06.24.what is ubix
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Kürzlich hochgeladen

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

MLconf NYC Xiangrui Meng

  • 1. MLlib: Scalable Machine Learning on Spark Xiangrui Meng 1 Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph Gonzalez, Michael Franklin, Michael I. Jordan, Tim Kraska, etc.
  • 3. What is MLlib? MLlib is a Spark subproject providing machine learning primitives: • initial contribution from AMPLab, UC Berkeley • shipped with Spark since version 0.8 • 33 contributors (March 2014) 3
  • 4. What is MLlib? Algorithms: • classification: logistic regression, linear support vector machine (SVM), naive Bayes • regression: generalized linear regression (GLM) • collaborative filtering: alternating least squares (ALS) • clustering: k-means • decomposition: singular value decomposition (SVD), principal component analysis (PCA) 4
  • 6. scikit-learn? Algorithms: • classification: SVM, nearest neighbors, random forest, … • regression: support vector regression (SVR), ridge regression, Lasso, logistic regression, … • clustering: k-means, spectral clustering, … • decomposition: PCA, non-negative matrix factorization (NMF), independent component analysis (ICA), … 6
  • 7. Mahout? Algorithms: • classification: logistic regression, naive Bayes, random forest, … • collaborative filtering: ALS, … • clustering: k-means, fuzzy k-means, … • decomposition: SVD, randomized SVD, … 7
  • 10. • It is built on Apache Spark, a fast and general engine for large-scale data processing. • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Write applications quickly in Java, Scala, or Python. Why MLlib? 10
  • 11. Word count val counts = sc.textFile("hdfs://...") .flatMap(line => line.split(" ")) .map(word => (word, 1L)) .reduceByKey(_ + _)
  • 12. Gradient descent val points = spark.textFile(...).map(parsePoint).cache() var w = Vector.zeros(d) for (i <- 1 to numIterations) { val gradient = points.map { p => (1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x ).reduce(_ + _) w -= alpha * gradient } 12
  • 13. // Load and parse the data. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache() // Cluster the data into two classes using KMeans. val clusters = KMeans.train(parsedData, 2, numIterations = 20) // Compute the sum of squared errors. val cost = clusters.computeCost(parsedData) println("Sum of squared errors = " + cost) k-means (scala) 13
  • 14. k-means (python) # Load and parse the data data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ‘)])).cache() # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations = 10, runs = 1, initialization_mode = "kmeans||") # Evaluate clustering by computing the sum of squared errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) cost = parsedData.map(lambda point: error(point)) .reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost)) 14
  • 15. Dimension reduction + k-means // compute principal components val points: RDD[Vector] = ... val mat = RowMatrix(points) val pc = mat.computePrincipalComponents(20) // project points to a low-dimensional space val projected = mat.multiply(pc).rows // train a k-means model on the projected data val model = KMeans.train(projected, 10)
  • 16. Why MLlib? • It ships with Spark as a standard component. 16
  • 17. Spark community One of the largest open source projects in big data: • 170+ developers contributing • 30+ companies contributing • 400+ discussions per month on the mailing list
  • 18. Out for dinner? • Search for a restaurant and make a reservation. • Start navigation. • Food looks good? Take a photo and share. 18
  • 19. Why smartphone? Out for dinner? • Search for a restaurant and make a reservation. (Yellow Pages?) • Start navigation. (GPS?) • Food looks good? Take a photo and share. (Camera?) 19
  • 20. Why MLlib? A special-purpose device may be better at one aspect than a general-purpose device. But the cost of context switching is high: • different languages or APIs • different data formats • different tuning tricks 20
  • 21. Spark SQL + MLlib // Data can easily be extracted from existing sources, // such as Apache Hive. val trainingTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userId = e.userId""") // Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib. val training = trainingTable.map { row => val features = Vectors.dense(row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = SVMWithSGD.train(training)
  • 22. Streaming + MLlib // collect tweets using streaming // train a k-means model val model: KMmeansModel = ... // apply model to filter tweets val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0))) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter(t => model.predict(featurize(t)) == clusterNumber) // print tweets within this particular cluster filteredTweets.print()
  • 23. GraphX + MLlib // assemble link graph val graph = Graph(pages, links) val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices // load page labels (spam or not) and content features val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ... val training: RDD[LabeledPoint] = labelAndFeatures.join(pageRank).map { case (id, ((label, features), pageRank)) => LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank)) } // train a spam detector using logistic regression val model = LogisticRegressionWithSGD.train(training)
  • 24. Why MLlib? • Built on Spark’s lightweight yet powerful APIs. • Spark’s open source community. • Seamless integration with Spark’s other components. 24 • Comparable to or even better than other libraries specialized in large-scale machine learning.
  • 26. Logistic regression - weak scaling • Full dataset: 200K images, 160K dense features. • Similar weak scaling. • MLlib within a factor of 2 of VW’s wall-clock time. MLlib MLlib 26
  • 27. Logistic regression - strong scaling • Fixed Dataset: 50K images, 160K dense features. • MLlib exhibits better scaling properties. • MLlib is faster than VW with 16 and 32 machines. MLlib MLlib 27
  • 29. Collaborative filtering • Recover a rating matrix from a subset of its entries.? ? ? ? ? 29
  • 31. ALS - wall-clock time • Dataset: scaled version of Netflix data (9X in size). • Cluster: 9 machines. • MLlib is an order of magnitude faster than Mahout. • MLlib is within factor of 2 of GraphLab. System Wall-clock time (seconds) MATLAB 15443 Mahout 4206 GraphLab 291 MLlib 481 31
  • 32. Implementation of ALS • broadcast everything • data parallel • fully parallel 32
  • 33. Broadcast everything • Master loads (small) data file and initializes models. • Master broadcasts data and initial models. • At each iteration, updated models are broadcast again. • Works OK for small data. • Lots of communication overhead - doesn’t scale well. • Ships with Spark ExamplesMaster Workers Ratings Movie Factors User Factors
  • 34. Data parallel • Workers load data • Master broadcasts initial models • At each iteration, updated models are broadcast again • Much better scaling • Works on large datasets • Works well for smaller models. (low K) Master Workers Ratings Movie Factors User Factors Ratings Ratings Ratings
  • 35. Fully parallel • Workers load data • Models are instantiated at workers. • At each iteration, models are shared via join between workers. • Much better scalability. • Works on large datasets • Works on big models (higher K) Master Workers RatingsMovie FactorsUser Factors RatingsMovie FactorsUser Factors RatingsMovie FactorsUser Factors RatingsMovie FactorsUser Factors
  • 36. Implementation of ALS • broadcast everything • data parallel • fully parallel • block-wise parallel • Users/products are partitioned into blocks and join is based on blocks instead of individual user/product. 36
  • 37. New features in v1.0 • Sparse data support • Classification and regression tree (CART) • Tall-and-skinny SVD and PCA • L-BFGS • Binary classification evaluation 37
  • 38. Contributors Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai, Evan Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden Karau, Hossein Falaki, Jey Kottalam, Cheng Lian, Marek Kolodziej, Mark Hamstra, Martin Jaggi, Martin Weindel, Matei Zaharia, Nick Pentreath, Patrick Wendell, Prashant Sharma, Reynold Xin, Reza Zadeh, Sandy Ryza, Sean Owen, Shivaram Venkataraman, Tor Myklebust, Xiangrui Meng, Xinghao Pan, Xusen Yin, Jerry Shao, Ryan LeCompte 38
  • 39. Interested? • Website: http://spark.apache.org • Tutorials: http://ampcamp.berkeley.edu • Spark Summit: http://spark-summit.org • Github: https://github.com/apache/spark • Mailing lists: user@spark.apache.org dev@spark.apache.org 39