SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Building Machine
Learning Applications
with Sparkling Water
NYC Big Data Science Meetup
Michal Malohlava and Alex Tellez and H2O.ai
Who am I?
Background
PhD in CS from Charles University in Prague, 2012
1 year PostDoc at Purdue University experimenting with
algos for large-scale computation
2 years at H2O.ai helping to develop H2O engine for big
data computation and analysis
Experience with domain-specific languages,
distributed system, software engineering, and big
data.
TBD
Head of Sales
Distributed
Systems
Engineers
Making

ML Scale!
Team@H2O.ai
Scalable 

Machine Learning
For Smarter
Applications
Smarter Applications
Scalable Applications
Distributed
Easy to experiment
Able to process huge data from different
sources
Powerful machine learning engine inside
BUT
how to build
them?
Build an application
with …
?
…with Spark and H2O
Open-source distributed execution platform
User-friendly API for data transformation based on RDD
Platform components - SQL, MLLib, text mining
Multitenancy
Large and active community
Open-source scalable machine
learning platform
Tuned for efficient computation
and memory use
Mature machine learning
algorithms
R, Python, Java, Scala APIs
Interactive UI
Ensembles
Deep Neural Networks
• Generalized Linear Models : Binomial, Gaussian,
Gamma, Poisson and Tweedie
• Cox Proportional Hazards Models
• Naïve Bayes
• Distributed Random Forest : Classification or
regression models
• Gradient Boosting Machine : Produces an
ensemble of decision trees with increasing refined
approximations
• Deep learning : Create multi-layer feed forward
neural networks starting with an input layer
followed by multiple layers of nonlinear
transformations
Statistical Analysis
Dimensionality
Reduction
Anomaly
Detection
• K-means : Partitions observations into k
clusters/groups of the same spatial size
• Principal Component Analysis : Linearly
transforms correlated variables to
independent components
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction using
deep learning
Clustering
Supervised
Learning
Unsupervised
Learning
Sparkling Water
Provides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and
algorithms with Spark API
Platform to build Smarter Applications
Excels in existing Spark workflows requiring
advanced Machine Learning algorithms
Sparkling Water Design
spark-submit
Spark
Master
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Spark
Worker
JVM
Sparkling Water Cluster
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Spark
Executor
JVM
H2O
Sparkling
App
implements
?
Contains application
and Sparkling Water
classes
Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVM
Data
Source
(e.g.
HDFS)
H2O
RDD
Spark Executor JVM
Spark Executor JVM
Spark
RDD
RDDs and DataFrames
share same memory
space
Development Internals
Sparkling Water Assembly
H2O
Core
H2O
Algos
H2O
Scala
API
H2O
Flow
Sparkling Water Core
Spark Platform
Spark
Core
Spark
SQL
Application
Code+
Assembly is deployed
to Spark cluster as regular
Spark application
Lets build
an application !
OR
Detect spam text messages
Data example
case class SMS(target: String, fv: Vector)
ML Workflow
1. Extract data
2. Transform, tokenize messages
3. Build Tf-IDF
4. Create and evaluate 

Deep Learning model
5. Use the model
Goal: For a given text message identify if
it is spam or not
Application
environment
Lego #1: Data load
// Data load

def load(dataFile: String): RDD[Array[String]] = {

sc.textFile(dataFile).map(l => l.split(“t"))
.filter(r => !r(0).isEmpty)

}
Lego #2: Ad-hoc
Tokenization
def tokenize(data: RDD[String]): RDD[Seq[String]] = {

val ignoredWords = Seq("the", “a", …)

val ignoredChars = Seq(',', ‘:’, …)



val texts = data.map( r => {

var smsText = r.toLowerCase

for( c <- ignoredChars) {

smsText = smsText.replace(c, ' ')

}



val words =smsText.split(" ").filter(w =>
!ignoredWords.contains(w) && w.length>2).distinct

words.toSeq

})

texts

}
Lego #3: Tf-IDF
def buildIDFModel(tokens: RDD[Seq[String]],

minDocFreq:Int = 4,

hashSpaceSize:Int = 1 << 10):
(HashingTF, IDFModel, RDD[Vector]) = {

// Hash strings into the given space

val hashingTF = new HashingTF(hashSpaceSize)

val tf = hashingTF.transform(tokens)

// Build term frequency-inverse document frequency

val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf)

val expandedText = idfModel.transform(tf)

(hashingTF, idfModel, expandedText)

}
Hash words

into large 

space
Term freq scale
“Thank for the order…”
[…,0,3.5,0,1,0,0.3,0,1.3,0,0,…]
Thank Order
Lego #4: Build a model
def buildDLModel(train: Frame, valid: Frame,

epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,

hidden: Array[Int] = Array[Int](200, 200))

(implicit h2oContext: H2OContext): DeepLearningModel = {

import h2oContext._

// Build a model

val dlParams = new DeepLearningParameters()

dlParams._destination_key = Key.make("dlModel.hex").asInstanceOf[Key[Frame]]

dlParams._train = train

dlParams._valid = valid

dlParams._response_column = 'target

dlParams._epochs = epochs

dlParams._l1 = l1

dlParams._hidden = hidden



// Create a job

val dl = new DeepLearning(dlParams)

val dlModel = dl.trainModel.get



// Compute metrics on both datasets

dlModel.score(train).delete()

dlModel.score(valid).delete()



dlModel

}
Deep Learning: Create
multi-layer feed forward
neural networks starting
with an input layer
followed by multiple
l a y e r s o f n o n l i n e a r
transformations
Assembly application
// Data load

val data = load(DATAFILE)

// Extract response spam or ham

val hamSpam = data.map( r => r(0))

val message = data.map( r => r(1))

// Tokenize message content

val tokens = tokenize(message)



// Build IDF model

var (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)



// Merge response with extracted vectors

val resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))

val table:DataFrame = resultRDD



// Split table

val keys = Array[String]("train.hex", "valid.hex")

val ratios = Array[Double](0.8)

val frs = split(table, keys, ratios)

val (train, valid) = (frs(0), frs(1))

table.delete()



// Build a model

val dlModel = buildDLModel(train, valid)
Split dataset
Build model
Data munging
Data exploration
Model evaluation
val trainMetrics = binomialMM(dlModel, train)

val validMetrics = binomialMM(dlModel, valid)
Collect model 

metrics
Spam predictor
def isSpam(msg: String,

dlModel: DeepLearningModel,

hashingTF: HashingTF,

idfModel: IDFModel,

hamThreshold: Double = 0.5):Boolean = {

val msgRdd = sc.parallelize(Seq(msg))

val msgVector: SchemaRDD = idfModel.transform(

hashingTF.transform (

tokenize (msgRdd)))
.map(v => SMS("?", v))

val msgTable: DataFrame = msgVector

msgTable.remove(0) // remove first column

val prediction = dlModel.score(msgTable)

prediction.vecs()(1).at(0) < hamThreshold

}
Prepared models
Default decision
threshold
Scoring
Predict spam
isSpam("Michal, beer
tonight in MV?")
isSpam("We tried to contact
you re your reply
to our offer of a Video
Handset? 750
anytime any networks mins?
UNLIMITED TEXT?")
Interactions with
application from R
Where is the code?
https://github.com/h2oai/sparkling-water/
blob/master/examples/scripts/
Sparkling Water Download
http://h2o.ai/download/
http://h2o-release.s3.amazonaws.com/
sparkling-water/master/91/index.html
Checkout H2O.ai Training Books
http://learn.h2o.ai/

Checkout H2O.ai Blog
http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata

Checkout GitHub
https://github.com/h2oai/sparkling-water
Meetups
https://meetup.com/
More info
Learn more at h2o.ai
Follow us at @h2oai
Thank you!
Sparkling Water is
open-source

ML application platform
combining

power of Spark and H2O

Weitere ähnliche Inhalte

Was ist angesagt?

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sri Ambati
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupSri Ambati
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data EnvironmentsSri Ambati
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 

Was ist angesagt? (20)

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data Environments
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 

Andere mochten auch

Sparkling Water 2.0 - Michal Malohlava
Sparkling Water 2.0 - Michal MalohlavaSparkling Water 2.0 - Michal Malohlava
Sparkling Water 2.0 - Michal MalohlavaSri Ambati
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Sri Ambati
 
Transform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine LearningTransform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine LearningSri Ambati
 
H2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine PrognosticsH2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine PrognosticsSri Ambati
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LASri Ambati
 
Hadoop cluster os_tuning_v1.0_20170106_mobile
Hadoop cluster os_tuning_v1.0_20170106_mobileHadoop cluster os_tuning_v1.0_20170106_mobile
Hadoop cluster os_tuning_v1.0_20170106_mobile상연 최
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2Oodsc
 
Sparkling Water
Sparkling WaterSparkling Water
Sparkling Waterh2oworld
 
Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016Sri Ambati
 
H2O World - ML Could Solve NLP Challenges: Ontology Management - Erik Huddleston
H2O World - ML Could Solve NLP Challenges: Ontology Management - Erik HuddlestonH2O World - ML Could Solve NLP Challenges: Ontology Management - Erik Huddleston
H2O World - ML Could Solve NLP Challenges: Ontology Management - Erik HuddlestonSri Ambati
 
H2O World - What Do Companies Need to do to Stay Ahead - Michael Marks
H2O World - What Do Companies Need to do to Stay Ahead - Michael MarksH2O World - What Do Companies Need to do to Stay Ahead - Michael Marks
H2O World - What Do Companies Need to do to Stay Ahead - Michael MarksSri Ambati
 
H2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi MehtaH2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi MehtaSri Ambati
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
 
Data Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OData Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OSri Ambati
 
H2O World - Collaborative, Reproducible Research with H2O - Nick Elprin
H2O World - Collaborative, Reproducible Research with H2O - Nick ElprinH2O World - Collaborative, Reproducible Research with H2O - Nick Elprin
H2O World - Collaborative, Reproducible Research with H2O - Nick ElprinSri Ambati
 
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda IngramH2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda IngramSri Ambati
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...Sri Ambati
 
Data & Data Alliances - Scott Mclellan
Data & Data Alliances - Scott MclellanData & Data Alliances - Scott Mclellan
Data & Data Alliances - Scott MclellanSri Ambati
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_onSri Ambati
 

Andere mochten auch (20)

Sparkling Water 2.0 - Michal Malohlava
Sparkling Water 2.0 - Michal MalohlavaSparkling Water 2.0 - Michal Malohlava
Sparkling Water 2.0 - Michal Malohlava
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Transform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine LearningTransform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine Learning
 
H2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine PrognosticsH2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine Prognostics
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LA
 
Hadoop cluster os_tuning_v1.0_20170106_mobile
Hadoop cluster os_tuning_v1.0_20170106_mobileHadoop cluster os_tuning_v1.0_20170106_mobile
Hadoop cluster os_tuning_v1.0_20170106_mobile
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2O
 
Sparkling Water
Sparkling WaterSparkling Water
Sparkling Water
 
Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016
 
H2O World - ML Could Solve NLP Challenges: Ontology Management - Erik Huddleston
H2O World - ML Could Solve NLP Challenges: Ontology Management - Erik HuddlestonH2O World - ML Could Solve NLP Challenges: Ontology Management - Erik Huddleston
H2O World - ML Could Solve NLP Challenges: Ontology Management - Erik Huddleston
 
H2O World - What Do Companies Need to do to Stay Ahead - Michael Marks
H2O World - What Do Companies Need to do to Stay Ahead - Michael MarksH2O World - What Do Companies Need to do to Stay Ahead - Michael Marks
H2O World - What Do Companies Need to do to Stay Ahead - Michael Marks
 
H2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi MehtaH2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi Mehta
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Data Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OData Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2O
 
H2O World - Collaborative, Reproducible Research with H2O - Nick Elprin
H2O World - Collaborative, Reproducible Research with H2O - Nick ElprinH2O World - Collaborative, Reproducible Research with H2O - Nick Elprin
H2O World - Collaborative, Reproducible Research with H2O - Nick Elprin
 
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda IngramH2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
H2O World - Migrating from Proprietary Analytics Software - Fonda Ingram
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...
 
Data & Data Alliances - Scott Mclellan
Data & Data Alliances - Scott MclellanData & Data Alliances - Scott Mclellan
Data & Data Alliances - Scott Mclellan
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
 

Ähnlich wie Building Machine Learning Applications with Sparkling Water

Sparkling Water Meetup 4.15.15
Sparkling Water Meetup 4.15.15Sparkling Water Meetup 4.15.15
Sparkling Water Meetup 4.15.15Sri Ambati
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaSri Ambati
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingPaco Nathan
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
2015 03 27_ml_conf
2015 03 27_ml_conf2015 03 27_ml_conf
2015 03 27_ml_confSri Ambati
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Nikhil Kaja Fair
Nikhil Kaja FairNikhil Kaja Fair
Nikhil Kaja Fairnikhilkaja
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFSri Ambati
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0Pabba Gupta
 

Ähnlich wie Building Machine Learning Applications with Sparkling Water (20)

Sparkling Water Meetup 4.15.15
Sparkling Water Meetup 4.15.15Sparkling Water Meetup 4.15.15
Sparkling Water Meetup 4.15.15
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
2015 03 27_ml_conf
2015 03 27_ml_conf2015 03 27_ml_conf
2015 03 27_ml_conf
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
VenkateshAvula
VenkateshAvulaVenkateshAvula
VenkateshAvula
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Nikhil Kaja Fair
Nikhil Kaja FairNikhil Kaja Fair
Nikhil Kaja Fair
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SF
 
sudipto_resume
sudipto_resumesudipto_resume
sudipto_resume
 
Yu's resume
Yu's resumeYu's resume
Yu's resume
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0
 

Mehr von Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Mehr von Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Kürzlich hochgeladen

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Kürzlich hochgeladen (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Building Machine Learning Applications with Sparkling Water

  • 1. Building Machine Learning Applications with Sparkling Water NYC Big Data Science Meetup Michal Malohlava and Alex Tellez and H2O.ai
  • 2. Who am I? Background PhD in CS from Charles University in Prague, 2012 1 year PostDoc at Purdue University experimenting with algos for large-scale computation 2 years at H2O.ai helping to develop H2O engine for big data computation and analysis Experience with domain-specific languages, distributed system, software engineering, and big data.
  • 4. Scalable 
 Machine Learning For Smarter Applications
  • 6. Scalable Applications Distributed Easy to experiment Able to process huge data from different sources Powerful machine learning engine inside
  • 10. Open-source distributed execution platform User-friendly API for data transformation based on RDD Platform components - SQL, MLLib, text mining Multitenancy Large and active community
  • 11. Open-source scalable machine learning platform Tuned for efficient computation and memory use Mature machine learning algorithms R, Python, Java, Scala APIs Interactive UI
  • 12. Ensembles Deep Neural Networks • Generalized Linear Models : Binomial, Gaussian, Gamma, Poisson and Tweedie • Cox Proportional Hazards Models • Naïve Bayes • Distributed Random Forest : Classification or regression models • Gradient Boosting Machine : Produces an ensemble of decision trees with increasing refined approximations • Deep learning : Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Statistical Analysis Dimensionality Reduction Anomaly Detection • K-means : Partitions observations into k clusters/groups of the same spatial size • Principal Component Analysis : Linearly transforms correlated variables to independent components • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning Clustering Supervised Learning Unsupervised Learning
  • 13.
  • 14. Sparkling Water Provides Transparent integration of H2O with Spark ecosystem Transparent use of H2O data structures and algorithms with Spark API Platform to build Smarter Applications Excels in existing Spark workflows requiring advanced Machine Learning algorithms
  • 15. Sparkling Water Design spark-submit Spark Master JVM Spark Worker JVM Spark Worker JVM Spark Worker JVM Sparkling Water Cluster Spark Executor JVM H2O Spark Executor JVM H2O Spark Executor JVM H2O Sparkling App implements ? Contains application and Sparkling Water classes
  • 16. Data Distribution H2O H2O H2O Sparkling Water Cluster Spark Executor JVM Data Source (e.g. HDFS) H2O RDD Spark Executor JVM Spark Executor JVM Spark RDD RDDs and DataFrames share same memory space
  • 17. Development Internals Sparkling Water Assembly H2O Core H2O Algos H2O Scala API H2O Flow Sparkling Water Core Spark Platform Spark Core Spark SQL Application Code+ Assembly is deployed to Spark cluster as regular Spark application
  • 20. Data example case class SMS(target: String, fv: Vector)
  • 21. ML Workflow 1. Extract data 2. Transform, tokenize messages 3. Build Tf-IDF 4. Create and evaluate 
 Deep Learning model 5. Use the model Goal: For a given text message identify if it is spam or not
  • 23. Lego #1: Data load // Data load
 def load(dataFile: String): RDD[Array[String]] = {
 sc.textFile(dataFile).map(l => l.split(“t")) .filter(r => !r(0).isEmpty)
 }
  • 24. Lego #2: Ad-hoc Tokenization def tokenize(data: RDD[String]): RDD[Seq[String]] = {
 val ignoredWords = Seq("the", “a", …)
 val ignoredChars = Seq(',', ‘:’, …)
 
 val texts = data.map( r => {
 var smsText = r.toLowerCase
 for( c <- ignoredChars) {
 smsText = smsText.replace(c, ' ')
 }
 
 val words =smsText.split(" ").filter(w => !ignoredWords.contains(w) && w.length>2).distinct
 words.toSeq
 })
 texts
 }
  • 25. Lego #3: Tf-IDF def buildIDFModel(tokens: RDD[Seq[String]],
 minDocFreq:Int = 4,
 hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = {
 // Hash strings into the given space
 val hashingTF = new HashingTF(hashSpaceSize)
 val tf = hashingTF.transform(tokens)
 // Build term frequency-inverse document frequency
 val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf)
 val expandedText = idfModel.transform(tf)
 (hashingTF, idfModel, expandedText)
 } Hash words
 into large 
 space Term freq scale “Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…] Thank Order
  • 26. Lego #4: Build a model def buildDLModel(train: Frame, valid: Frame,
 epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
 hidden: Array[Int] = Array[Int](200, 200))
 (implicit h2oContext: H2OContext): DeepLearningModel = {
 import h2oContext._
 // Build a model
 val dlParams = new DeepLearningParameters()
 dlParams._destination_key = Key.make("dlModel.hex").asInstanceOf[Key[Frame]]
 dlParams._train = train
 dlParams._valid = valid
 dlParams._response_column = 'target
 dlParams._epochs = epochs
 dlParams._l1 = l1
 dlParams._hidden = hidden
 
 // Create a job
 val dl = new DeepLearning(dlParams)
 val dlModel = dl.trainModel.get
 
 // Compute metrics on both datasets
 dlModel.score(train).delete()
 dlModel.score(valid).delete()
 
 dlModel
 } Deep Learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple l a y e r s o f n o n l i n e a r transformations
  • 27. Assembly application // Data load
 val data = load(DATAFILE)
 // Extract response spam or ham
 val hamSpam = data.map( r => r(0))
 val message = data.map( r => r(1))
 // Tokenize message content
 val tokens = tokenize(message)
 
 // Build IDF model
 var (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)
 
 // Merge response with extracted vectors
 val resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))
 val table:DataFrame = resultRDD
 
 // Split table
 val keys = Array[String]("train.hex", "valid.hex")
 val ratios = Array[Double](0.8)
 val frs = split(table, keys, ratios)
 val (train, valid) = (frs(0), frs(1))
 table.delete()
 
 // Build a model
 val dlModel = buildDLModel(train, valid) Split dataset Build model Data munging
  • 29. Model evaluation val trainMetrics = binomialMM(dlModel, train)
 val validMetrics = binomialMM(dlModel, valid) Collect model 
 metrics
  • 30. Spam predictor def isSpam(msg: String,
 dlModel: DeepLearningModel,
 hashingTF: HashingTF,
 idfModel: IDFModel,
 hamThreshold: Double = 0.5):Boolean = {
 val msgRdd = sc.parallelize(Seq(msg))
 val msgVector: SchemaRDD = idfModel.transform(
 hashingTF.transform (
 tokenize (msgRdd))) .map(v => SMS("?", v))
 val msgTable: DataFrame = msgVector
 msgTable.remove(0) // remove first column
 val prediction = dlModel.score(msgTable)
 prediction.vecs()(1).at(0) < hamThreshold
 } Prepared models Default decision threshold Scoring
  • 31. Predict spam isSpam("Michal, beer tonight in MV?") isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")
  • 33. Where is the code? https://github.com/h2oai/sparkling-water/ blob/master/examples/scripts/
  • 35. Checkout H2O.ai Training Books http://learn.h2o.ai/
 Checkout H2O.ai Blog http://h2o.ai/blog/
 Checkout H2O.ai Youtube Channel https://www.youtube.com/user/0xdata
 Checkout GitHub https://github.com/h2oai/sparkling-water Meetups https://meetup.com/ More info
  • 36. Learn more at h2o.ai Follow us at @h2oai Thank you! Sparkling Water is open-source
 ML application platform combining
 power of Spark and H2O