An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).
2. +
2
Who Am I
Orzota, Inc.
Making BigData Easy
Designing a Cloud-based platform for ETL, Analytics
Past Work Experience
Persistent Systems Ltd.
Recommendation Engines and User Behavior Analytics.
Area of Interest
Machine Learning
Distributed Systems
Recommendation Engines
5. +
5
Introduction
“Machine Learning is Programming Computers to
optimize a Performance Criterion using Example Data
or Past Experience”
Term coined by Arthur Samuel
"Field of study that gives computers the ability to learn without being
explicitly programmed“.
Branch of Artificial Intelligence and Statistics
Focuses on prediction based on known properties
Used as a sub-process in Data Mining.
Data Mining focuses on discovering new, unknown properties.
6. +
6
Learning Algorithms
Supervised Learning
Unsupervised Learning
Unlabelled input data.
Creating a function to predict the relation and output
Semi-Supervised Learning
Labelled input data.
Creating classifiers to predict unseen inputs.
Combines Supervised and Unsupervised Learning methodology
Reinforcement Learning
Reward-Punishment based agent.
7. +
7
Supervised Learning
Introduction
Learn from the Data
Data is already labelled
Expert, Crowd-sourced or case-based labelling of data.
Applications
Handwriting Recognition
Spam Detection
Information Retrieval
Personalisation based on ranks
Speech Recognition
11. +
11
Supervised Learning
Example: Naive Bayes Classifier
Running a test on the Classifier
“Order a trial Adobe
chicken daily EABList new summer
savings, welcome!”
Classifier
Spam
Bin
12. +
12
Unsupervised Learning
Introduction
Finding hidden structure in data
Unlabelled Data
SMEs needed post-processing to verify, validate and use the
output
Used in exploratory analysis rather than predictive analytics
Applications
Pattern Recognition
Groupings based on a distance measure
Group of People, Objects, ...
15. +
15
Learning Problem
Cat and Dog Problem
Humans can easily classify which is a cat and which is a dog.
But how can a computer do that?
Some attempts used Clustering Mechanisms to solve it – Cooccurence Clustering, Deep Learning
18. +
Size
Need
BigData
Ever-growing data.
Yesterday’s methods to
process tomorrow’s data
Cheap Storage
Scalable from Ground Up
Lines
Sample
Data
KBs –
low MBs
Prototype
Data
Analysis and
Visualisation
Analysis and
Visualisation
Tools18
Whiteboard,
Bash, ...
Matlab,
Octave, R,
Processing,
Bash, ...
Storage
MySQL (DBs),
...
Analysis
NumPy, SciPy,
Pandas,
Weka..
MBs – low
GBs
Should be build on top of anyOnline
existing Distributed Systems Data
framework
Should contain distributed
version of ML algorithms
Classification
GBs
– TBs
– PBs
Visualisation
Flare,
AmCharts,
Raphael
Storage
HDFS, Hbase,
Cassandra,...
Analysis
Hive, Giraph,
Hama, Mahout
21. +
21
Recommender Systems
Introduction
Types of Recommender Systems
Content Based Recommendations
Collaborative Filtering Recommendations
User-User Recommendations
Item-Item Recommendations
Dimensionality Reduction (SVD) Recommendations
Applications
Products you would like to buy
People you might want to connect with
Potential Life-Partners
Recommending Songs you might like
...
25. +
25
Apache Mahout
Recommender System
Architecture
Two Modes
Stand-alone non distributed (“Taste”)
Scalable Distributed Algorithmic version
for Collaborative Filtering
Top-level Packages
Data Model
User Similarity
Item Similarity
User Neighbourhood
Recommender