1) Mahout is an Apache project that builds a scalable machine learning library.
2) It aims to support a variety of machine learning tasks such as clustering, classification, and recommendation.
3) Mahout algorithms are implemented using MapReduce to scale linearly with large datasets.
Potential of AI (Generative AI) in Business: Learnings and Insights
SDEC2011 Mahout - the what, the how and the why
1. The what, the why and the how
Speaker: Robin Anil, Apache Mahout PMC Member
SDEC, Seoul, Korea, June 2011
2. about:me
● Apache Mahout PMC member
● A ML Believer
● Author of Mahout in Action
● Software Engineer @ Google
*in no particular order
● Previous Life: Google Summer of Code student for 2 years.
3. about:agenda
● Introducing Mahout
● Why Mahout?
● Birds eye view of Mahout
● Classic Machine learning problems
● Short overview of Mahout clustering
5. Scale!
● Scale to large datasets
○ Hadoop MapReduce implementations that scales linearly
with data.
○ Fast sequential algorithms whose runtime doesn’t depend
on the size of the data
○ Goal: To be as fast as possible for any algorithm
● Scalable to support your business case
○ Apache Software License 2
● Scalable community
○ Vibrant, responsive and diverse
○ Come to the mailing list and find out more
6. about:why
● Lack community
● Lack scalability
● Lack documentations and examples
● Lack Apache licensing
● Are not well tested
● Are Research oriented
● Not built over existing production quality libraries
7. Birds eye view of Mahout
● If you want to:
○ Encode
○ Analyze
○ Predict
○ Get top best
8. about:encode
● Process data and convert to vectors
● Dictionary based v/s Randomizer based
● Get best signals for generating vectors
○ Collocation information (ngrams)
○ Lp Normalization
Data Engineering Camp
1:1.0 2:1.0 3:1.0
9. about:analyze
● Cluster and group data to
● Cluster data
○ K-Means
○ Fuzzy K-Means
○ Canopy
○ Mean Shift
○ Dirichlet process clustering
○ Spectral Clustering
● Co-cluster features / dimensionality reduction
○ Latent Dirichlet Allocation (LDA)
○ Singular Value Decomposition
11. about:lda
● Grouping similar or co-occurring features into a topic
○ Topic “Lol Cat”:
■ Cat
■ Meow
■ Purr
■ Haz
■ Cheeseburger
■ Lol
12. about:predict
● Classification and Recommendation
● Classification:
○ Use features learn model
○ Apply model on unknown
● Recommendation
○ Use pairwise(user-item) information to learn model
○ For a given user return highly likely items
14. about:classify
● Plenty of algorithms
○ Naïve Bayes
○ Complementary Naïve Bayes
○ Random Forests
○ Stocastic Gradient Descent(regression)
● Learn a model from a manually classified data
● Predict the class of a new object based on its
features and the learned model
15. about:recommend
● Predict what the user likes based on
○ His/Her historical behavior
○ Aggregate behavior of people similar to him
16. about:recommend
● Different types of recommenders
○ User based
○ Item based
○ Co-occurrence based
● Full framework for storage, online
online and offline computation of recommendations
● Like clustering, there is a notion of similarity in users or items
○ Cosine, Tanimoto, Pearson and LLR
19. about:parallel-fp-growth
● Identify the most commonly
occurring patterns from
○ Sales Transactions
buy “Milk, eggs and bread”
○ Query Logs
ipad -> apple, tablet, iphone
○ Spam Detection
Yahoo! http://www.slideshare.net/hadoopusergroup/mail-
antispam
20. summary:in-short
● Plenty of overlap. There is no one algorithm to fit all problems.
● Analyze and iterate fast
● MapReduce implementations makes these Fly!
21. Did you know?
● Apache Mahout uses Colt high-performance collections
○ Open HashMaps instead of Chained HashMaps
○ Arrays of Primitive types
○ Available as Mahout Math library
● Mahout Vector uses integer encoding techniques to reduce
space.
● Fastest classifier in Mahout doesn’t use MapReduce!
○ And it learns online
○ And It doesn’t look at all the data.
22. How to use mahout
● Command line launcher bin/mahout
● See the list of tools and algorithms by running bin/mahout
● Run any algorithm by its shortname:
○ bin/mahout kmeans –help
● By default runs locally
● export HADOOP_HOME = /pathto/hadoop-0.20.2/
○ Runs on the cluster configured as per the conf files in the
hadoop directory
● Use driver classes to launch jobs:
○ KMeansDriver.runjob(Path input, Path output …)
23. Clustering Walkthrough (tiny example)
● Input: set of text files in a directory
● Download Mahout and unzip
○ mvn install
○ bin/mahout seqdirectory –i <input> –o <seq-output>
○ bin/mahout seq2sparse –i seq-output –o <vector-output>
○ bin/mahout kmeans –i<vector-output>
-c <cluster-temp> -o <cluster-output> -k 10 –cd 0.01 –x 20
24. Clustering Walkthrough (a bit more)
● Use bigrams: -ng 2
● Prune low frequency: –s 10
● Normalize: -n 2
● Use a distance measure : -dm org.apache.mahout.common.
distance.CosineDistanceMeasure
25. Clustering Walkthrough (viewing results)
● bin/mahout clusterdump
–s cluster-output/clusters-9/part-00000
-d vector-output/dictionary.file-*
-dt sequencefile -n 5 -b 100
● Top terms in a typical cluster
comic => 9.793121272867376
comics => 6.115341078151356
con => 5.015090566692931
sdcc => 3.927590843402978
webcomics => 2.916910980686997
26. Road to Mahout v1.0
● Guiding Principles
○ Use the stable Hadoop API
○ Make vector the de-factor input format for all parts of code
○ Provide stable API for developers
27. Get Started
● http://mahout.apache.org
● dev@mahout.apache.org - Developer mailing list
● user@mahout.apache.org - User mailing list
● Check out the documentations and wiki for quickstart
● http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
29. Thanks to
● Apache Foundation
● Mahout Committers
● Google Summer of Code Organizers
● And Students
● Open source!
● And NHN for hosting this at Seoul!
● and the wonderful engineers (present and future) in the room.