SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
The what, the why and the how
Speaker: Robin Anil, Apache Mahout PMC Member

          SDEC, Seoul, Korea, June 2011
about:me
● Apache Mahout PMC member
● A ML Believer
● Author of Mahout in Action
● Software Engineer @ Google

 *in no particular order


● Previous Life: Google Summer of Code student for 2 years.
about:agenda
● Introducing Mahout
● Why Mahout?
● Birds eye view of Mahout
● Classic Machine learning problems
● Short overview of Mahout clustering
about:mission




To build a scalable machine learning library
Scale!
● Scale to large datasets
   ○ Hadoop MapReduce implementations that scales linearly
     with data.
   ○ Fast sequential algorithms whose runtime doesn’t depend
     on the size of the data
   ○ Goal: To be as fast as possible for any algorithm
● Scalable to support your business case
   ○ Apache Software License 2
● Scalable community
   ○ Vibrant, responsive and diverse
   ○ Come to the mailing list and find out more
about:why
● Lack community
● Lack scalability
● Lack documentations and examples
● Lack Apache licensing
● Are not well tested
● Are Research oriented
● Not built over existing production quality libraries
Birds eye view of Mahout
● If you want to:
   ○ Encode
   ○ Analyze
   ○ Predict
   ○ Get top best
about:encode
● Process data and convert to vectors
● Dictionary based v/s Randomizer based
● Get best signals for generating vectors
   ○ Collocation information (ngrams)
   ○ Lp Normalization



  Data      Engineering Camp

  1:1.0     2:1.0            3:1.0
about:analyze
● Cluster and group data to
● Cluster data
   ○ K-Means
   ○ Fuzzy K-Means
   ○ Canopy
   ○ Mean Shift
   ○ Dirichlet process clustering
   ○ Spectral Clustering
● Co-cluster features / dimensionality reduction
   ○ Latent Dirichlet Allocation (LDA)
   ○ Singular Value Decomposition
about:clustering
News clusters
about:lda
● Grouping similar or co-occurring features into a topic
   ○ Topic “Lol Cat”:
       ■ Cat
       ■ Meow
       ■ Purr
       ■ Haz
       ■ Cheeseburger
       ■ Lol
about:predict
● Classification and Recommendation


● Classification:
   ○ Use features learn model
   ○ Apply model on unknown
● Recommendation
   ○ Use pairwise(user-item) information to learn model
   ○ For a given user return highly likely items
about:classify
● Predicting the type of a new object based on its features
● The types are predetermined




                      Dog           Cat
about:classify
● Plenty of algorithms
   ○ Naïve Bayes
   ○ Complementary Naïve Bayes
   ○ Random Forests
    ○ Stocastic Gradient Descent(regression)
● Learn a model from a manually classified data
● Predict the class of a new object based on its
  features and the learned model
about:recommend
● Predict what the user likes based on
   ○ His/Her historical behavior
   ○ Aggregate behavior of people similar to him
about:recommend
● Different types of recommenders
   ○ User based
   ○ Item based
    ○ Co-occurrence based
● Full framework for storage, online
  online and offline computation of recommendations
● Like clustering, there is a notion of similarity in users or items
   ○ Cosine, Tanimoto, Pearson and LLR
about:top-n
● Frequent Pattern Mining:
   ○ Identify top-K patterns
about:frequent-pattern-mining
● Find interesting groups of items based on how they co-occur in
  a dataset
about:parallel-fp-growth
● Identify the most commonly
  occurring patterns from
   ○ Sales Transactions
     buy “Milk, eggs and bread”
   ○ Query Logs
        ipad -> apple, tablet, iphone
   ○ Spam Detection
        Yahoo! http://www.slideshare.net/hadoopusergroup/mail-
        antispam
summary:in-short
● Plenty of overlap. There is no one algorithm to fit all problems.
● Analyze and iterate fast
● MapReduce implementations makes these Fly!
Did you know?
● Apache Mahout uses Colt high-performance collections
   ○ Open HashMaps instead of Chained HashMaps
   ○ Arrays of Primitive types
   ○ Available as Mahout Math library
● Mahout Vector uses integer encoding techniques to reduce
  space.
● Fastest classifier in Mahout doesn’t use MapReduce!
   ○ And it learns online
   ○ And It doesn’t look at all the data.
How to use mahout
● Command line launcher bin/mahout
● See the list of tools and algorithms by running bin/mahout
● Run any algorithm by its shortname:
   ○ bin/mahout kmeans –help
● By default runs locally
● export HADOOP_HOME = /pathto/hadoop-0.20.2/
   ○ Runs on the cluster configured as per the conf files in the
     hadoop directory
● Use driver classes to launch jobs:
   ○ KMeansDriver.runjob(Path input, Path output …)
Clustering Walkthrough (tiny example)
● Input: set of text files in a directory
● Download Mahout and unzip
   ○ mvn install
   ○ bin/mahout seqdirectory –i <input> –o <seq-output>
   ○ bin/mahout seq2sparse –i seq-output –o <vector-output>
   ○ bin/mahout kmeans –i<vector-output>
     -c <cluster-temp> -o <cluster-output> -k 10 –cd 0.01 –x 20
Clustering Walkthrough (a bit more)

● Use bigrams: -ng 2
● Prune low frequency: –s 10
● Normalize: -n 2


● Use a distance measure : -dm org.apache.mahout.common.
  distance.CosineDistanceMeasure
Clustering Walkthrough (viewing results)
 ● bin/mahout clusterdump
   –s cluster-output/clusters-9/part-00000
   -d vector-output/dictionary.file-*
   -dt sequencefile -n 5 -b 100
 ● Top terms in a typical cluster
comic => 9.793121272867376
comics => 6.115341078151356
con => 5.015090566692931
sdcc => 3.927590843402978
webcomics => 2.916910980686997
Road to Mahout v1.0
● Guiding Principles
   ○ Use the stable Hadoop API
   ○ Make vector the de-factor input format for all parts of code
   ○ Provide stable API for developers
Get Started
● http://mahout.apache.org
● dev@mahout.apache.org - Developer mailing list
● user@mahout.apache.org - User mailing list
● Check out the documentations and wiki for quickstart
● http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
Resources
● “Mahout in Action” Owen, Anil, Dunning, Friedman
  http://www.manning.com/owen


● “Taming Text” Ingersoll, Morton, Farris
  http://www.manning.com/ingersoll


● “Introducing Apache Mahout”
  http://www.ibm.com/developerworks/java/library/j-mahout/
Thanks to
● Apache Foundation
● Mahout Committers
● Google Summer of Code Organizers
● And Students
● Open source!


● And NHN for hosting this at Seoul!
● and the wonderful engineers (present and future) in the room.
References
● news.google.com
● Cat http://www.flickr.com/photos/gattou/3178745634/
● Dog http://www.flickr.com/photos/30800139@N04/3879737638/
● Milk Eggs Bread http://www.flickr.
  com/photos/nauright/4792775946/
● Amazon Recommendations
● twitter

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan Casey
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutAman Adhikari
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture OverviewStefano Dalla Palma
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to MahoutUri Lavi
 

Was ist angesagt? (20)

Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Mahout
MahoutMahout
Mahout
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 

Ähnlich wie SDEC2011 Mahout - the what, the how and the why

Recommendation engines
Recommendation enginesRecommendation engines
Recommendation enginesGeorgian Micsa
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15MLconf
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConfXavier Amatriain
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark StoriesRylan Halteman
 
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraApache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraAnant Corporation
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Abhishek Thakur
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheapMarc Cluet
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Anant Corporation
 
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Scaling Recommendations at Quora (RecSys talk 9/16/2016)Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Scaling Recommendations at Quora (RecSys talk 9/16/2016)Nikhil Dandekar
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Neo4j
 
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + DemosDrools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + DemosMauricio (Salaboy) Salatino
 
Serverless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportServerless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportMetosin Oy
 

Ähnlich wie SDEC2011 Mahout - the what, the how and the why (20)

Recommendation engines
Recommendation enginesRecommendation engines
Recommendation engines
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark Stories
 
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraApache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
Mahout
MahoutMahout
Mahout
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Scaling Recommendations at Quora (RecSys talk 9/16/2016)Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
 
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + DemosDrools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
 
Evolve with laravel
Evolve with laravelEvolve with laravel
Evolve with laravel
 
Serverless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportServerless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience report
 

Mehr von Korea Sdec

SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerKorea Sdec
 
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionKorea Sdec
 
SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopKorea Sdec
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopKorea Sdec
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingKorea Sdec
 
SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of PigKorea Sdec
 
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutSDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutKorea Sdec
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveKorea Sdec
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsKorea Sdec
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopKorea Sdec
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveKorea Sdec
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 RapidantKorea Sdec
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACCKorea Sdec
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesKorea Sdec
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedKorea Sdec
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudKorea Sdec
 

Mehr von Korea Sdec (16)

SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuer
 
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestion
 
SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing Hadoop
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoop
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
 
SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of Pig
 
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutSDEC2011 Essentials of Mahout
SDEC2011 Essentials of Mahout
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing Hadoop
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 Rapidant
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACC
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & Experiences
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloud
 

Kürzlich hochgeladen

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 

Kürzlich hochgeladen (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 

SDEC2011 Mahout - the what, the how and the why

  • 1. The what, the why and the how Speaker: Robin Anil, Apache Mahout PMC Member SDEC, Seoul, Korea, June 2011
  • 2. about:me ● Apache Mahout PMC member ● A ML Believer ● Author of Mahout in Action ● Software Engineer @ Google *in no particular order ● Previous Life: Google Summer of Code student for 2 years.
  • 3. about:agenda ● Introducing Mahout ● Why Mahout? ● Birds eye view of Mahout ● Classic Machine learning problems ● Short overview of Mahout clustering
  • 4. about:mission To build a scalable machine learning library
  • 5. Scale! ● Scale to large datasets ○ Hadoop MapReduce implementations that scales linearly with data. ○ Fast sequential algorithms whose runtime doesn’t depend on the size of the data ○ Goal: To be as fast as possible for any algorithm ● Scalable to support your business case ○ Apache Software License 2 ● Scalable community ○ Vibrant, responsive and diverse ○ Come to the mailing list and find out more
  • 6. about:why ● Lack community ● Lack scalability ● Lack documentations and examples ● Lack Apache licensing ● Are not well tested ● Are Research oriented ● Not built over existing production quality libraries
  • 7. Birds eye view of Mahout ● If you want to: ○ Encode ○ Analyze ○ Predict ○ Get top best
  • 8. about:encode ● Process data and convert to vectors ● Dictionary based v/s Randomizer based ● Get best signals for generating vectors ○ Collocation information (ngrams) ○ Lp Normalization Data Engineering Camp 1:1.0 2:1.0 3:1.0
  • 9. about:analyze ● Cluster and group data to ● Cluster data ○ K-Means ○ Fuzzy K-Means ○ Canopy ○ Mean Shift ○ Dirichlet process clustering ○ Spectral Clustering ● Co-cluster features / dimensionality reduction ○ Latent Dirichlet Allocation (LDA) ○ Singular Value Decomposition
  • 11. about:lda ● Grouping similar or co-occurring features into a topic ○ Topic “Lol Cat”: ■ Cat ■ Meow ■ Purr ■ Haz ■ Cheeseburger ■ Lol
  • 12. about:predict ● Classification and Recommendation ● Classification: ○ Use features learn model ○ Apply model on unknown ● Recommendation ○ Use pairwise(user-item) information to learn model ○ For a given user return highly likely items
  • 13. about:classify ● Predicting the type of a new object based on its features ● The types are predetermined Dog Cat
  • 14. about:classify ● Plenty of algorithms ○ Naïve Bayes ○ Complementary Naïve Bayes ○ Random Forests ○ Stocastic Gradient Descent(regression) ● Learn a model from a manually classified data ● Predict the class of a new object based on its features and the learned model
  • 15. about:recommend ● Predict what the user likes based on ○ His/Her historical behavior ○ Aggregate behavior of people similar to him
  • 16. about:recommend ● Different types of recommenders ○ User based ○ Item based ○ Co-occurrence based ● Full framework for storage, online online and offline computation of recommendations ● Like clustering, there is a notion of similarity in users or items ○ Cosine, Tanimoto, Pearson and LLR
  • 17. about:top-n ● Frequent Pattern Mining: ○ Identify top-K patterns
  • 18. about:frequent-pattern-mining ● Find interesting groups of items based on how they co-occur in a dataset
  • 19. about:parallel-fp-growth ● Identify the most commonly occurring patterns from ○ Sales Transactions buy “Milk, eggs and bread” ○ Query Logs ipad -> apple, tablet, iphone ○ Spam Detection Yahoo! http://www.slideshare.net/hadoopusergroup/mail- antispam
  • 20. summary:in-short ● Plenty of overlap. There is no one algorithm to fit all problems. ● Analyze and iterate fast ● MapReduce implementations makes these Fly!
  • 21. Did you know? ● Apache Mahout uses Colt high-performance collections ○ Open HashMaps instead of Chained HashMaps ○ Arrays of Primitive types ○ Available as Mahout Math library ● Mahout Vector uses integer encoding techniques to reduce space. ● Fastest classifier in Mahout doesn’t use MapReduce! ○ And it learns online ○ And It doesn’t look at all the data.
  • 22. How to use mahout ● Command line launcher bin/mahout ● See the list of tools and algorithms by running bin/mahout ● Run any algorithm by its shortname: ○ bin/mahout kmeans –help ● By default runs locally ● export HADOOP_HOME = /pathto/hadoop-0.20.2/ ○ Runs on the cluster configured as per the conf files in the hadoop directory ● Use driver classes to launch jobs: ○ KMeansDriver.runjob(Path input, Path output …)
  • 23. Clustering Walkthrough (tiny example) ● Input: set of text files in a directory ● Download Mahout and unzip ○ mvn install ○ bin/mahout seqdirectory –i <input> –o <seq-output> ○ bin/mahout seq2sparse –i seq-output –o <vector-output> ○ bin/mahout kmeans –i<vector-output> -c <cluster-temp> -o <cluster-output> -k 10 –cd 0.01 –x 20
  • 24. Clustering Walkthrough (a bit more) ● Use bigrams: -ng 2 ● Prune low frequency: –s 10 ● Normalize: -n 2 ● Use a distance measure : -dm org.apache.mahout.common. distance.CosineDistanceMeasure
  • 25. Clustering Walkthrough (viewing results) ● bin/mahout clusterdump –s cluster-output/clusters-9/part-00000 -d vector-output/dictionary.file-* -dt sequencefile -n 5 -b 100 ● Top terms in a typical cluster comic => 9.793121272867376 comics => 6.115341078151356 con => 5.015090566692931 sdcc => 3.927590843402978 webcomics => 2.916910980686997
  • 26. Road to Mahout v1.0 ● Guiding Principles ○ Use the stable Hadoop API ○ Make vector the de-factor input format for all parts of code ○ Provide stable API for developers
  • 27. Get Started ● http://mahout.apache.org ● dev@mahout.apache.org - Developer mailing list ● user@mahout.apache.org - User mailing list ● Check out the documentations and wiki for quickstart ● http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
  • 28. Resources ● “Mahout in Action” Owen, Anil, Dunning, Friedman http://www.manning.com/owen ● “Taming Text” Ingersoll, Morton, Farris http://www.manning.com/ingersoll ● “Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/
  • 29. Thanks to ● Apache Foundation ● Mahout Committers ● Google Summer of Code Organizers ● And Students ● Open source! ● And NHN for hosting this at Seoul! ● and the wonderful engineers (present and future) in the room.
  • 30. References ● news.google.com ● Cat http://www.flickr.com/photos/gattou/3178745634/ ● Dog http://www.flickr.com/photos/30800139@N04/3879737638/ ● Milk Eggs Bread http://www.flickr. com/photos/nauright/4792775946/ ● Amazon Recommendations ● twitter