SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
10 R Packages to Win
Kaggle Competitions
Xavier Conort
Data Scientist
Previously... … now!
Competitions that boosted my R learning curve
The Machine seems much smarter than I am at capturing complexity in
the data even for simple datasets!
Humans can help the Machine too! But don’t oversimplify and discard
any data.
Don’t be impatient. My best GBM had 24,500 trees with learning rate =
0.01!
SVM and feature selection matter too!
Word n-grams and character n-grams can make a big difference
Parallel processing and big servers can help with complex feature
engineering!
Still many awesome tools in R that I don’t know!
Glmnet can do a great job!
Competitions that boosted my R learning curve
10 R Packages:
Allow the Machine to Capture Complexity
1. gbm
2. randomForest
3. e1071
Take Advantage of High-Cardinality Categorical or Text Data
4. glmnet
5. tau
Make Your Code More Efficient
6. Matrix
7. SOAR
8. forEach
9. doMC
10. data.table
Capture Complexity Automatically
1. gbm
Gradient Boosting Machine (Freud & Schapiro)
Greg Ridgeway / Harry Southworth
Key Trick:
Use gbm.more to write your own early-stopping procedure
2. randomForest
Random Forests (Breiman & Cutler)
Authors: Breiman and Cutler
Maintainer: Andy Liaw
Key Trick:
Importance=True for permutation importance
Tune the sampsize parameter for faster computation and
handling unbalanced classes
3. e1071
3. e1071:Support Vector Machines
Maintainer: David Meyer
Key Tricks:
Use kernlab (Karatzoglou, Smola and Hornik) to get
heuristic
Write own pattern search
Take Advantage of High-Cardinality
Categorical or Text Features
4. glmnet
Authors: Friedman, Hastie, Simon, Tibshirani
L1 / Elasticnet / L2
Key Tricks:
- Try interactions of 2 or more categorical variables
- Test your code on the Kaggle: “Amazon Employ Access
Challenge”
5. tau
Maintainer: Kurt Hornik
Used for automating text-mining
Key Trick:
Try character n-grams. They work surprisingly well!
Make Your Code More Efficient
6. Matrix
Authors / Maintainers: Douglas Bates and Martin Maechler
Key Trick:
Use sparse.model.matrix for one-hot encoding
7. SOAR
Author / Maintainer: Bill Venables
Used to store large R objects in the cache and release
memory
Key Trick:
Once I found out about it, it made my R Experience great!
(Just remember to empty your cache … )
8. forEach and 9. doMC
Authors: Revolution Analytics
Key Trick:
Use for parallel-processing to speed up computation
10. data.table
Authors: M Dowle, T Short and others
Maintainer: Matt Dowle
Key Trick:
Essential for doing fast data aggregation operations at
scale
Don’t Forget ..
Use your intuition to help the machine!
● Always compute differences / ratios of features
o This can help the Machine a lot!
● Always consider discarding features that are “too good”
o They can make the Machine lazy!
o An example: GE Flight Quest
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine LearningCodeForFrankfurt
 
Europython - Machine Learning for dummies with Python
Europython - Machine Learning for dummies with PythonEuropython - Machine Learning for dummies with Python
Europython - Machine Learning for dummies with PythonJavier Arias Losada
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Pybcn machine learning for dummies with python
Pybcn machine learning for dummies with pythonPybcn machine learning for dummies with python
Pybcn machine learning for dummies with pythonJavier Arias Losada
 
Deep learning at nmc devin jones
Deep learning at nmc devin jones Deep learning at nmc devin jones
Deep learning at nmc devin jones Ido Shilon
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine LearningPranav Ainavolu
 
An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)Thomas da Silva Paula
 
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...Natalia Díaz Rodríguez
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningNandita Naik
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013Philip Zheng
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learningBabu Priyavrat
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Aseda Owusua Addai-Deseh
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised LearningLukas Tencer
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningRebecca Bilbro
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 

Was ist angesagt? (20)

General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine Learning
 
Europython - Machine Learning for dummies with Python
Europython - Machine Learning for dummies with PythonEuropython - Machine Learning for dummies with Python
Europython - Machine Learning for dummies with Python
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Pybcn machine learning for dummies with python
Pybcn machine learning for dummies with pythonPybcn machine learning for dummies with python
Pybcn machine learning for dummies with python
 
kaggle_meet_up
kaggle_meet_upkaggle_meet_up
kaggle_meet_up
 
Deep learning at nmc devin jones
Deep learning at nmc devin jones Deep learning at nmc devin jones
Deep learning at nmc devin jones
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
 
An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)An introduction to Machine Learning (and a little bit of Deep Learning)
An introduction to Machine Learning (and a little bit of Deep Learning)
 
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learning
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 

Andere mochten auch

Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The PeopleDaniel Tunkelang
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in PythonImry Kissos
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning SystemsXavier Amatriain
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentationlpaviglianiti
 
Rstudio上でのパッケージインストールを便利にするaddin4githubinstall
Rstudio上でのパッケージインストールを便利にするaddin4githubinstallRstudio上でのパッケージインストールを便利にするaddin4githubinstall
Rstudio上でのパッケージインストールを便利にするaddin4githubinstallAtsushi Hayakawa
 
~knitr+pandocではじめる~『R MarkdownでReproducible Research』
~knitr+pandocではじめる~『R MarkdownでReproducible Research』~knitr+pandocではじめる~『R MarkdownでReproducible Research』
~knitr+pandocではじめる~『R MarkdownでReproducible Research』Nagi Teramo
 
How i became a data scientist
How i became a data scientistHow i became a data scientist
How i became a data scientistOwen Zhang
 

Andere mochten auch (20)

Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
 
Id3,c4.5 algorithim
Id3,c4.5 algorithimId3,c4.5 algorithim
Id3,c4.5 algorithim
 
Rstudio上でのパッケージインストールを便利にするaddin4githubinstall
Rstudio上でのパッケージインストールを便利にするaddin4githubinstallRstudio上でのパッケージインストールを便利にするaddin4githubinstall
Rstudio上でのパッケージインストールを便利にするaddin4githubinstall
 
~knitr+pandocではじめる~『R MarkdownでReproducible Research』
~knitr+pandocではじめる~『R MarkdownでReproducible Research』~knitr+pandocではじめる~『R MarkdownでReproducible Research』
~knitr+pandocではじめる~『R MarkdownでReproducible Research』
 
How i became a data scientist
How i became a data scientistHow i became a data scientist
How i became a data scientist
 

Ähnlich wie 10 R Packages to Win Kaggle Competitions

Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator ProgramGoDataDriven
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
The Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms ProgrammingThe Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms ProgrammingJuan J. Merelo
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBigML, Inc
 
VSSML18. Deepnets and Time Series
VSSML18. Deepnets and Time SeriesVSSML18. Deepnets and Time Series
VSSML18. Deepnets and Time SeriesBigML, Inc
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocolsc.titus.brown
 
DN 2017 | Connecting the Enterprise with Machine Learning and Neo4j | Tim War...
DN 2017 | Connecting the Enterprise with Machine Learning and Neo4j | Tim War...DN 2017 | Connecting the Enterprise with Machine Learning and Neo4j | Tim War...
DN 2017 | Connecting the Enterprise with Machine Learning and Neo4j | Tim War...Dataconomy Media
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify Dataconomy Media
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18DataconomyGmbH
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Anand Sampat
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Pramati Technologies
 

Ähnlich wie 10 R Packages to Win Kaggle Competitions (20)

Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
The Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms ProgrammingThe Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms Programming
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
VSSML18. Deepnets and Time Series
VSSML18. Deepnets and Time SeriesVSSML18. Deepnets and Time Series
VSSML18. Deepnets and Time Series
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
DN 2017 | Connecting the Enterprise with Machine Learning and Neo4j | Tim War...
DN 2017 | Connecting the Enterprise with Machine Learning and Neo4j | Tim War...DN 2017 | Connecting the Enterprise with Machine Learning and Neo4j | Tim War...
DN 2017 | Connecting the Enterprise with Machine Learning and Neo4j | Tim War...
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)
 
Matlab programming
Matlab programmingMatlab programming
Matlab programming
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 

Kürzlich hochgeladen

Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 

Kürzlich hochgeladen (16)

Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 

10 R Packages to Win Kaggle Competitions

  • 1. 10 R Packages to Win Kaggle Competitions Xavier Conort Data Scientist
  • 3. Competitions that boosted my R learning curve The Machine seems much smarter than I am at capturing complexity in the data even for simple datasets! Humans can help the Machine too! But don’t oversimplify and discard any data. Don’t be impatient. My best GBM had 24,500 trees with learning rate = 0.01! SVM and feature selection matter too!
  • 4. Word n-grams and character n-grams can make a big difference Parallel processing and big servers can help with complex feature engineering! Still many awesome tools in R that I don’t know! Glmnet can do a great job! Competitions that boosted my R learning curve
  • 5. 10 R Packages: Allow the Machine to Capture Complexity 1. gbm 2. randomForest 3. e1071 Take Advantage of High-Cardinality Categorical or Text Data 4. glmnet 5. tau Make Your Code More Efficient 6. Matrix 7. SOAR 8. forEach 9. doMC 10. data.table
  • 7. 1. gbm Gradient Boosting Machine (Freud & Schapiro) Greg Ridgeway / Harry Southworth Key Trick: Use gbm.more to write your own early-stopping procedure
  • 8. 2. randomForest Random Forests (Breiman & Cutler) Authors: Breiman and Cutler Maintainer: Andy Liaw Key Trick: Importance=True for permutation importance Tune the sampsize parameter for faster computation and handling unbalanced classes
  • 9. 3. e1071 3. e1071:Support Vector Machines Maintainer: David Meyer Key Tricks: Use kernlab (Karatzoglou, Smola and Hornik) to get heuristic Write own pattern search
  • 10. Take Advantage of High-Cardinality Categorical or Text Features
  • 11. 4. glmnet Authors: Friedman, Hastie, Simon, Tibshirani L1 / Elasticnet / L2 Key Tricks: - Try interactions of 2 or more categorical variables - Test your code on the Kaggle: “Amazon Employ Access Challenge”
  • 12. 5. tau Maintainer: Kurt Hornik Used for automating text-mining Key Trick: Try character n-grams. They work surprisingly well!
  • 13. Make Your Code More Efficient
  • 14. 6. Matrix Authors / Maintainers: Douglas Bates and Martin Maechler Key Trick: Use sparse.model.matrix for one-hot encoding
  • 15. 7. SOAR Author / Maintainer: Bill Venables Used to store large R objects in the cache and release memory Key Trick: Once I found out about it, it made my R Experience great! (Just remember to empty your cache … )
  • 16. 8. forEach and 9. doMC Authors: Revolution Analytics Key Trick: Use for parallel-processing to speed up computation
  • 17. 10. data.table Authors: M Dowle, T Short and others Maintainer: Matt Dowle Key Trick: Essential for doing fast data aggregation operations at scale
  • 18. Don’t Forget .. Use your intuition to help the machine! ● Always compute differences / ratios of features o This can help the Machine a lot! ● Always consider discarding features that are “too good” o They can make the Machine lazy! o An example: GE Flight Quest