SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
An Introduction to Collaborative Filtering
             with Apache Mahout



                         Sebastian Schelter
                   Recommender Systems Challenge
                        at ACM RecSys 2012




         Database Systems and Information Management Group (DIMA)
                        Technische Universität Berlin

13.09.2012
                       http://www.dima.tu-berlin.de/
                               DIMA – TU Berlin                 1
Overview


■ Apache Mahout: apache-licensed library
  with the goal to provide highly scalable
  data mining and machine learning

■ its collaborative filtering module is based on the Taste
  framework of Sean Owen

■ mostly aimed at production scenarios, with a focus on
    □ processing efficiency
    □ integratibility with different datastores, web applications, Amazon EC2
    □ scalability, allows computation of recommendations, items similarities and
      matrix decompositions via MapReduce on Apache Hadoop

■ not that much used in recommender challenges
    □ not enough different algorithms implemented?
    □ not enough tooling for evaluation?

    → it‘s open source, so it‘s up to you to change that!


 13.09.2012                       DIMA – TU Berlin                        2
Preference & DataModel

■ Preference encapsulates a user-item-interaction as
  (user,item,value) triple
    □ only numeric userIDs and itemIDs allowed for memory efficiency
    □ PreferenceArray encapsulates a set of preferences


■ DataModel encapsulates a dataset
    □ lots of convenient accessor methods like getNumUsers(),
      getPreferencesForItem(itemID), ...
    □ allows to add temporal information to preferences
    □ lots of options to store the data (in-memory, file, database, key-value
      store)
    □ drawback: for a lot of usecases, all the data has to fit into memory to allow
      efficient recommendation

DataModel dataModel = new FileDataModel(new File(„movielens.csv“));

PreferenceArray prefsOfUser1 = dataModel.getPreferencesFromUser(1);


 13.09.2012                      DIMA – TU Berlin                       3
Recommender

■ Recommender is the basic interface for all of Mahout‘s
  recommenders
    □ recommend n items for a particular user
    □ estimate the preference of a user towards an item


■ a CandidateItemsStrategy fetches all items that might be
  recommended for a particular user

■ a Rescorer allows postprocessing recommendations


List<RecommendedItem> topItems = recommender.recommend(1, 10);

float preference = recommender.estimatePreference(1, 25);




 13.09.2012                     DIMA – TU Berlin            4
Item-Based Collaborative Filtering

■ ItemBasedRecommender
    □ can also compute item similarities
    □ can provide preferences for items as justification for recommendations


■ lots of similarity measures available (Pearson correlation,
  Jaccard coefficient, ...)

■ also allows usage of precomputed item similarities stored in a
  file (via FileItemSimilarity)

ItemBasedRecommender recommender =
   new GenericItemBasedRecommender(dataModel,
   new PearsonCorrelationSimilarity(dataModel));

List<RecommendedItem> similarItems =
   recommender.mostSimilarItems(5, 10);



 13.09.2012                     DIMA – TU Berlin                      5
Latent factor models

■ SVDRecommender
    □ uses a decomposition of the user-item-interaction matrix to compute
      recommendations


■ uses a Factorizer to compute a Factorization from a
  DataModel, several different implementations available

    □ Simon Funk‘s SGD
    □ Alternating Least Squares
    □ Weighted matrix factorization for implicit feedback data

Factorizer factorizer = new ALSWRFactorizer(dataModel, numFeatures,
   lambda, numIterations);

Recommender svdRecommender =
   new SVDRecommender(dataModel, factorizer);

List<RecommendedItem> topItems = svdRecommender.recommend(1, 10);



 13.09.2012                      DIMA – TU Berlin                   6
Evaluating recommenders

■ RecommenderEvaluator, RecommenderIRStatsEvaluator
    □ allow to measure the prediction quality of a recommender by using a
      random split of the dataset
    □ support for MAE, RMSE, Precision, Recall, ....
    □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the
      training data



RecommenderEvaluator maeEvaluator = new
   AverageAbsoluteDifferenceRecommenderEvaluator();

maeEvaluator.evaluate(
   new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),
   new InteractionCutDataModelBuilder(maxPrefsPerUser),
   dataModel, trainingPercentage, 1 - trainingPercentage);




 13.09.2012                    DIMA – TU Berlin                    7
Evaluating recommenders

■ RecommenderEvaluator, RecommenderIRStatsEvaluator
    □ allow to measure the prediction quality of a recommender by using a
      random split of the dataset
    □ support for MAE, RMSE, Precision, Recall, ....
    □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the
      training data



RecommenderEvaluator maeEvaluator = new
   AverageAbsoluteDifferenceRecommenderEvaluator();

maeEvaluator.evaluate(
   new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),
   new InteractionCutDataModelBuilder(maxPrefsPerUser),
   dataModel, trainingPercentage, 1 - trainingPercentage);




 13.09.2012                    DIMA – TU Berlin                    8
Starting to work on Mahout

■ Prerequisites
    □ Java 6
    □ Maven
    □ svn client


■ checkout the source code from
  http://svn.apache.org/repos/asf/mahout/trunk

■ import it as a maven project into your favorite IDE




 13.09.2012               DIMA – TU Berlin              9
Project: novel item similarity measure

■ in the Million Song DataSet Challenge, a novel item
  similarity measure was used in the winning solution

■ would be great to see this one also featured in Mahout

■ Task
    □ implement the novel item similarity measure as subclass of Mahout’s
      ItemSimilarity


■ Future Work
    □ this novel similarity measure is asymmetric, ensure that it is correctly
      applied in all scenarios




 13.09.2012                      DIMA – TU Berlin                       10
Project: temporal split evaluator

■ currently Mahout‘s standard RecommenderEvaluator
  randomly splits the data into training and test set

■ for datasets with timestamps it would be much more
  interesting use this temporal information to split the data
  into training and test set

■ Task
    □ create a TemporalSplitRecommenderEvaluator similar to the existing
      AbstractDifferenceRecommenderEvaluator


■ Future Work
    □ factor out the logic for splitting datasets into training and test set




 13.09.2012                       DIMA – TU Berlin                        11
Project: baseline method for rating prediction

■ port MyMediaLite’s UserItemBaseline to Mahout
  (preliminary port already available)

■ user-item-baseline estimation is a simple approach that
  estimates the global tendency of a user or an item to
  deviate from the average rating
  (described in Y. Koren: Factor in the Neighbors: Scalable
  and Accurate Collaborative Filtering, TKDD 2009)

■ Task
    □ polish the code
    □ make it work with Mahout’s DataModel


■ Future Work
    □ create an ItemBasedRecommender that makes use of the estimated
      biases



 13.09.2012                   DIMA – TU Berlin                  12
Thank you.




                 Questions?




Sebastian Schelter
Database Systems and Information Management Group (DIMA)
Technische Universität Berlin
    13.09.2012                    DIMA – TU Berlin         13

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache MahoutDaniel Glauser
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutAman Adhikari
 
Movie recommendation system using Apache Mahout and Facebook APIs
Movie recommendation system using Apache Mahout and Facebook APIsMovie recommendation system using Apache Mahout and Facebook APIs
Movie recommendation system using Apache Mahout and Facebook APIsSmitha Mysore Lokesh
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture OverviewStefano Dalla Palma
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to MahoutUri Lavi
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentationNaoki Nakatani
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 

Was ist angesagt? (20)

Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
Mahout
MahoutMahout
Mahout
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Movie recommendation system using Apache Mahout and Facebook APIs
Movie recommendation system using Apache Mahout and Facebook APIsMovie recommendation system using Apache Mahout and Facebook APIs
Movie recommendation system using Apache Mahout and Facebook APIs
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 

Ähnlich wie Introduction to Collaborative Filtering with Apache Mahout

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5Robert Grossman
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
AzureML TechTalk
AzureML TechTalkAzureML TechTalk
AzureML TechTalkUdaya Kumar
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfvitm11
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...RINUSATHYAN
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cyclehktripathy
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amatoSSSW
 
A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)Arnab Biswas
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Gülden Bilgütay
 
Hudup - A Framework of E-commercial Recommendation Algorithms
Hudup - A Framework of E-commercial Recommendation AlgorithmsHudup - A Framework of E-commercial Recommendation Algorithms
Hudup - A Framework of E-commercial Recommendation AlgorithmsLoc Nguyen
 
Solving the Issue of Mysterious Database Benchmarking Results
Solving the Issue of Mysterious Database Benchmarking ResultsSolving the Issue of Mysterious Database Benchmarking Results
Solving the Issue of Mysterious Database Benchmarking ResultsScyllaDB
 
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...Nicolas Sarramagna
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Toolsijsrd.com
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 

Ähnlich wie Introduction to Collaborative Filtering with Apache Mahout (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
MyMediaLite
MyMediaLiteMyMediaLite
MyMediaLite
 
AzureML TechTalk
AzureML TechTalkAzureML TechTalk
AzureML TechTalk
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
 
Hudup - A Framework of E-commercial Recommendation Algorithms
Hudup - A Framework of E-commercial Recommendation AlgorithmsHudup - A Framework of E-commercial Recommendation Algorithms
Hudup - A Framework of E-commercial Recommendation Algorithms
 
Solving the Issue of Mysterious Database Benchmarking Results
Solving the Issue of Mysterious Database Benchmarking ResultsSolving the Issue of Mysterious Database Benchmarking Results
Solving the Issue of Mysterious Database Benchmarking Results
 
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Tools
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 

Mehr von sscdotopen

Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahoutsscdotopen
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenderssscdotopen
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filteringsscdotopen
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 

Mehr von sscdotopen (8)

Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Bringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to MahoutBringing Algebraic Semantics to Mahout
Bringing Algebraic Semantics to Mahout
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Latent factor models for Collaborative Filtering
Latent factor models for Collaborative FilteringLatent factor models for Collaborative Filtering
Latent factor models for Collaborative Filtering
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
mahout-cf
mahout-cfmahout-cf
mahout-cf
 

Kürzlich hochgeladen

ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 

Kürzlich hochgeladen (20)

ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 

Introduction to Collaborative Filtering with Apache Mahout

  • 1. An Introduction to Collaborative Filtering with Apache Mahout Sebastian Schelter Recommender Systems Challenge at ACM RecSys 2012 Database Systems and Information Management Group (DIMA) Technische Universität Berlin 13.09.2012 http://www.dima.tu-berlin.de/ DIMA – TU Berlin 1
  • 2. Overview ■ Apache Mahout: apache-licensed library with the goal to provide highly scalable data mining and machine learning ■ its collaborative filtering module is based on the Taste framework of Sean Owen ■ mostly aimed at production scenarios, with a focus on □ processing efficiency □ integratibility with different datastores, web applications, Amazon EC2 □ scalability, allows computation of recommendations, items similarities and matrix decompositions via MapReduce on Apache Hadoop ■ not that much used in recommender challenges □ not enough different algorithms implemented? □ not enough tooling for evaluation? → it‘s open source, so it‘s up to you to change that! 13.09.2012 DIMA – TU Berlin 2
  • 3. Preference & DataModel ■ Preference encapsulates a user-item-interaction as (user,item,value) triple □ only numeric userIDs and itemIDs allowed for memory efficiency □ PreferenceArray encapsulates a set of preferences ■ DataModel encapsulates a dataset □ lots of convenient accessor methods like getNumUsers(), getPreferencesForItem(itemID), ... □ allows to add temporal information to preferences □ lots of options to store the data (in-memory, file, database, key-value store) □ drawback: for a lot of usecases, all the data has to fit into memory to allow efficient recommendation DataModel dataModel = new FileDataModel(new File(„movielens.csv“)); PreferenceArray prefsOfUser1 = dataModel.getPreferencesFromUser(1); 13.09.2012 DIMA – TU Berlin 3
  • 4. Recommender ■ Recommender is the basic interface for all of Mahout‘s recommenders □ recommend n items for a particular user □ estimate the preference of a user towards an item ■ a CandidateItemsStrategy fetches all items that might be recommended for a particular user ■ a Rescorer allows postprocessing recommendations List<RecommendedItem> topItems = recommender.recommend(1, 10); float preference = recommender.estimatePreference(1, 25); 13.09.2012 DIMA – TU Berlin 4
  • 5. Item-Based Collaborative Filtering ■ ItemBasedRecommender □ can also compute item similarities □ can provide preferences for items as justification for recommendations ■ lots of similarity measures available (Pearson correlation, Jaccard coefficient, ...) ■ also allows usage of precomputed item similarities stored in a file (via FileItemSimilarity) ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, new PearsonCorrelationSimilarity(dataModel)); List<RecommendedItem> similarItems = recommender.mostSimilarItems(5, 10); 13.09.2012 DIMA – TU Berlin 5
  • 6. Latent factor models ■ SVDRecommender □ uses a decomposition of the user-item-interaction matrix to compute recommendations ■ uses a Factorizer to compute a Factorization from a DataModel, several different implementations available □ Simon Funk‘s SGD □ Alternating Least Squares □ Weighted matrix factorization for implicit feedback data Factorizer factorizer = new ALSWRFactorizer(dataModel, numFeatures, lambda, numIterations); Recommender svdRecommender = new SVDRecommender(dataModel, factorizer); List<RecommendedItem> topItems = svdRecommender.recommend(1, 10); 13.09.2012 DIMA – TU Berlin 6
  • 7. Evaluating recommenders ■ RecommenderEvaluator, RecommenderIRStatsEvaluator □ allow to measure the prediction quality of a recommender by using a random split of the dataset □ support for MAE, RMSE, Precision, Recall, .... □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training data RecommenderEvaluator maeEvaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); maeEvaluator.evaluate( new BiasedRecommenderBuilder(lambda2, lambda3, numIterations), new InteractionCutDataModelBuilder(maxPrefsPerUser), dataModel, trainingPercentage, 1 - trainingPercentage); 13.09.2012 DIMA – TU Berlin 7
  • 8. Evaluating recommenders ■ RecommenderEvaluator, RecommenderIRStatsEvaluator □ allow to measure the prediction quality of a recommender by using a random split of the dataset □ support for MAE, RMSE, Precision, Recall, .... □ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training data RecommenderEvaluator maeEvaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); maeEvaluator.evaluate( new BiasedRecommenderBuilder(lambda2, lambda3, numIterations), new InteractionCutDataModelBuilder(maxPrefsPerUser), dataModel, trainingPercentage, 1 - trainingPercentage); 13.09.2012 DIMA – TU Berlin 8
  • 9. Starting to work on Mahout ■ Prerequisites □ Java 6 □ Maven □ svn client ■ checkout the source code from http://svn.apache.org/repos/asf/mahout/trunk ■ import it as a maven project into your favorite IDE 13.09.2012 DIMA – TU Berlin 9
  • 10. Project: novel item similarity measure ■ in the Million Song DataSet Challenge, a novel item similarity measure was used in the winning solution ■ would be great to see this one also featured in Mahout ■ Task □ implement the novel item similarity measure as subclass of Mahout’s ItemSimilarity ■ Future Work □ this novel similarity measure is asymmetric, ensure that it is correctly applied in all scenarios 13.09.2012 DIMA – TU Berlin 10
  • 11. Project: temporal split evaluator ■ currently Mahout‘s standard RecommenderEvaluator randomly splits the data into training and test set ■ for datasets with timestamps it would be much more interesting use this temporal information to split the data into training and test set ■ Task □ create a TemporalSplitRecommenderEvaluator similar to the existing AbstractDifferenceRecommenderEvaluator ■ Future Work □ factor out the logic for splitting datasets into training and test set 13.09.2012 DIMA – TU Berlin 11
  • 12. Project: baseline method for rating prediction ■ port MyMediaLite’s UserItemBaseline to Mahout (preliminary port already available) ■ user-item-baseline estimation is a simple approach that estimates the global tendency of a user or an item to deviate from the average rating (described in Y. Koren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD 2009) ■ Task □ polish the code □ make it work with Mahout’s DataModel ■ Future Work □ create an ItemBasedRecommender that makes use of the estimated biases 13.09.2012 DIMA – TU Berlin 12
  • 13. Thank you. Questions? Sebastian Schelter Database Systems and Information Management Group (DIMA) Technische Universität Berlin 13.09.2012 DIMA – TU Berlin 13