SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Putting the Magic in Data Science 
11/04/2014 
Sean J. Taylor 
QCon SF
https://research.facebook.com/datascience 
Core Data Science 
@ Facebook
http://en.wikipedia.org/wiki/Pasteur's_quadrant 
“The mission of CDS is to provide research and innovation that 
fundamentally increase the magnitude of Facebook’s success.”
It’s not a trick, it’s an illusion.
Any sufficiently advanced 
technology is indistinguishable 
from magic. 
— Arthur C. Clarke
1. create technology: 
people who are not experts can 
use it easily with little difficulty 
and trust the output 
! 
2. make it “sufficiently advanced”
The Data Science Venn Diagram 
Drew Conway
Basic 
Research 
Applied 
Research 
Working 
Prototype 
Quality 
Code 
Tool or 
Service 
Maybe someday, someone can use this. 
I might be able to use this. 
I can use this (sometimes). 
Software engineers can use this. 
People can use this.
People can use it → People want to use it 
Data Science Impact = 
Value * (Num People) * (Frequency of Use) 
Very difficult to demand that people use new tech — must 
make a compelling value proposition for people and 
educate them.
What can data do? 
Data can’t do anything. 
People do things with data. 
(usually they make decisions)
The Last Mile Problem 
! 
It works for you. Can you get people to use it? 
Without considering this last step, all subsequent steps are 
useless.
Existence of 
Data 
Creation of 
Technologies 
Counterfactuals and Causal Inference 
Morgan and Winship 
Decision 
Existence of Quality 
Capabilities
Magical + Effective Data Science Tools 
• Planout: language for expressing / deploying experimental 
designs 
• Deltoid: analyzing the results of experiments 
• ClustR: generic document clustering 
• Prophet: completely automatic forecasting procedure 
• Crystal Ball: large scale, interpretable regression models 
• Hive / Presto / Scuba: SQL engines for different problems
Outline 
! 
1. Sources of Magic 
2. Solving the Last Mile
Tricks: 
Sources of 
Data Science 
Magic
http://premise.com
http://premise.com
Of Data 
Trick 1: invest in data collection 
! 
Novel sources of data are magic.
Making your own quality data is 
better than being a data alchemist. 
>
Let’s say you have a billion users… 
and you want to listen to them all
Trick 2: Dimensionality Reduction 
Increasingly individual observations can be very high 
dimensional: text documents, images, audio. 
! 
! 
! 
! 
Clustering and classification techniques can find/extract a 
smaller dimensional representation that retains meaning.
Deep Learning is just (very) 
fancy dimensionality reduction
Problem: 
Estimate the probability 
of rare events or events 
pertaining to new 
objects. 
E.g. click, like, comment, 
share
Trick 3: Be a (Practical) Bayesian 
! 
• If you have rare or new things you’d like to learn about, it’s 
often hard to say much. 
• But it’s sometimes easy to think of cases which are similar 
to the one you are trying to predict. 
• James-Stein estimators demonstrate that weighted 
averages including related observations will help improve 
predictions.
0 
14 14 
billion 
Philadelphia Facebook Eagles Revenue 
Wins
Trick 4: Bootstrap all the statistics 
! 
• The bootstrap allows you to get a sampling distribution 
over almost any statistic you can compute from your data. 
• Embarrassingly parallelizable / computable online. 
confidence intervals
Bootstrapping in Practice 
R1 
All Your 
Data 
R2 
… 
R500 
Generate random 
sub-samples 
s1 
s2 
… 
s500 
Compute statistics 
or estimate model 
parameters 
} 
7.5 
5.0 
2.5 
0.0 
-2 -1 0 1 2 
Statistic 
Count 
Get a distribution 
over statistic of interest 
(usually the prediction) 
- take mean 
- CIs == 95% quantiles 
- SEs == standard deviation
Grab bag of tricks 
• Everything is linear if you use enough features. 
• Matrix factorizations: NMF, SVD. 
• Probabilistic data structures: LSH, min-hash. 
• Exploit distributed, online algorithms as much as possible. 
• “A little bit of ridge never hurts.” — Trevor Hastie 
• Label propagation: use data about network neighbors. 
• Data reduction: create bins & analyze weighted bin stats.
Last Mile of 
Data Science 
Magic
Principle 1: Reliability 
! 
“60% of the time, it works every time”
Test-driven data science 
Learn how to build reliable data science systems from 
software engineers. 
1. Write test fixtures with simulated or case-study data 
sets. 
2. Write automated tests that check that your system 
works on fixtures, and add new ones when it doesn’t. 
3. (Bonus) Test input data to ensure it meets all 
assumptions.
Principle 2: Latency + Interactivity 
! 
“how many hypotheses per second 
are you testing/generating?”
Answer more questions 
People have good intuitions and tend to search effectively 
given understandable tools. 
First order effect of speed: more answers per second. 
Second order effect of speed: more questions asked. 
Deltoid: effortless experimentation 
Scuba: in-memory, distributed, sampled database. 
Presto: aggressive caching, distributed SQL query engine
Principle 3: Simplicity + Modularity
Choose one thing to do very well 
! 
• It makes it easier to optimize your technology. 
! 
• It makes it easier for people to understand what it does. 
! 
• It makes people more likely to build around it.
Principle 4: Unexpectedness
Show people the most interesting things
Tricks Explained 
• Planout: simplicity + modularity 
• Deltoid: effortless experiment analysis + bootstrap 
• ClustR: dimensionality reduction + interactivity 
• Prophet: everything’s linear + basis expansion + new data 
• Crystal Ball: everything’s linear + regularization + speed 
• Hive / Presto / Scuba: reliability/latency tradeoffs
1. Learn as many tricks as you can 
2. Combine them in novel ways 
3. Consider the last mile
sjt@fb.com 
http://seanjtaylor.com 
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceLivePerson
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)Buhwan Jeong
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist? HackerEarth
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine LearningCorey Chivers
 
Claudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science OnlineClaudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science Onlinesfdatascience
 
Data Science: A Mindset for Productivity
Data Science: A Mindset for ProductivityData Science: A Mindset for Productivity
Data Science: A Mindset for ProductivityDaniel Tunkelang
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Onlinesfdatascience
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
 
How to become a data scientist
How to become a data scientistHow to become a data scientist
How to become a data scientistDeZyre
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
 
How to Identify, Train or Become a Data Scientist
How to Identify, Train or Become a Data ScientistHow to Identify, Train or Become a Data Scientist
How to Identify, Train or Become a Data ScientistInside Analysis
 
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleApplied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleDomino Data Lab
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsMichał Łopuszyński
 
My Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningMy Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningDaniel Tunkelang
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school studentsMelanie Manning, CFA
 
Information Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionInformation Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionKrist Wongsuphasawat
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 

Was ist angesagt? (20)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine Learning
 
Claudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science OnlineClaudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science Online
 
Data Science: A Mindset for Productivity
Data Science: A Mindset for ProductivityData Science: A Mindset for Productivity
Data Science: A Mindset for Productivity
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Online
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
How to become a data scientist
How to become a data scientistHow to become a data scientist
How to become a data scientist
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
How to Identify, Train or Become a Data Scientist
How to Identify, Train or Become a Data ScientistHow to Identify, Train or Become a Data Scientist
How to Identify, Train or Become a Data Scientist
 
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleApplied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 
My Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningMy Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine Learning
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school students
 
Information Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionInformation Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An Introduction
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 

Ähnlich wie Putting the Magic in Data Science

Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
An Introduction to Deep Learning (May 2018)
An Introduction to Deep Learning (May 2018)An Introduction to Deep Learning (May 2018)
An Introduction to Deep Learning (May 2018)Julien SIMON
 
An Introduction to Deep Learning I AWS Dev Day 2018
An Introduction to Deep Learning I AWS Dev Day 2018An Introduction to Deep Learning I AWS Dev Day 2018
An Introduction to Deep Learning I AWS Dev Day 2018AWS Germany
 
An Introduction to Deep Learning (April 2018)
An Introduction to Deep Learning (April 2018)An Introduction to Deep Learning (April 2018)
An Introduction to Deep Learning (April 2018)Julien SIMON
 
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Andrew Gardner
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Amr Rashed
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning TutorialAmr Rashed
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
An introduction to deep learning concepts
An introduction to deep learning conceptsAn introduction to deep learning concepts
An introduction to deep learning conceptsAmazon Web Services
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data ScienceTravis Oliphant
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
CRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining ProjectsCRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining ProjectsData Science Warsaw
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Julien SIMON
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7CS, NcState
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera
 

Ähnlich wie Putting the Magic in Data Science (20)

Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
An Introduction to Deep Learning (May 2018)
An Introduction to Deep Learning (May 2018)An Introduction to Deep Learning (May 2018)
An Introduction to Deep Learning (May 2018)
 
An Introduction to Deep Learning I AWS Dev Day 2018
An Introduction to Deep Learning I AWS Dev Day 2018An Introduction to Deep Learning I AWS Dev Day 2018
An Introduction to Deep Learning I AWS Dev Day 2018
 
An Introduction to Deep Learning (April 2018)
An Introduction to Deep Learning (April 2018)An Introduction to Deep Learning (April 2018)
An Introduction to Deep Learning (April 2018)
 
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Deep learning tutorial 9/2019
Deep learning tutorial 9/2019
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
 
An introduction to deep learning concepts
An introduction to deep learning conceptsAn introduction to deep learning concepts
An introduction to deep learning concepts
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
How to crack down big data?
How to crack down big data? How to crack down big data?
How to crack down big data?
 
CRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining ProjectsCRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining Projects
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
 

Kürzlich hochgeladen

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Kürzlich hochgeladen (20)

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 

Putting the Magic in Data Science

  • 1. Putting the Magic in Data Science 11/04/2014 Sean J. Taylor QCon SF
  • 3. http://en.wikipedia.org/wiki/Pasteur's_quadrant “The mission of CDS is to provide research and innovation that fundamentally increase the magnitude of Facebook’s success.”
  • 4. It’s not a trick, it’s an illusion.
  • 5.
  • 6. Any sufficiently advanced technology is indistinguishable from magic. — Arthur C. Clarke
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. 1. create technology: people who are not experts can use it easily with little difficulty and trust the output ! 2. make it “sufficiently advanced”
  • 13. The Data Science Venn Diagram Drew Conway
  • 14. Basic Research Applied Research Working Prototype Quality Code Tool or Service Maybe someday, someone can use this. I might be able to use this. I can use this (sometimes). Software engineers can use this. People can use this.
  • 15. People can use it → People want to use it Data Science Impact = Value * (Num People) * (Frequency of Use) Very difficult to demand that people use new tech — must make a compelling value proposition for people and educate them.
  • 16. What can data do? Data can’t do anything. People do things with data. (usually they make decisions)
  • 17. The Last Mile Problem ! It works for you. Can you get people to use it? Without considering this last step, all subsequent steps are useless.
  • 18. Existence of Data Creation of Technologies Counterfactuals and Causal Inference Morgan and Winship Decision Existence of Quality Capabilities
  • 19. Magical + Effective Data Science Tools • Planout: language for expressing / deploying experimental designs • Deltoid: analyzing the results of experiments • ClustR: generic document clustering • Prophet: completely automatic forecasting procedure • Crystal Ball: large scale, interpretable regression models • Hive / Presto / Scuba: SQL engines for different problems
  • 20. Outline ! 1. Sources of Magic 2. Solving the Last Mile
  • 21. Tricks: Sources of Data Science Magic
  • 22.
  • 25. Of Data Trick 1: invest in data collection ! Novel sources of data are magic.
  • 26. Making your own quality data is better than being a data alchemist. >
  • 27. Let’s say you have a billion users… and you want to listen to them all
  • 28.
  • 29. Trick 2: Dimensionality Reduction Increasingly individual observations can be very high dimensional: text documents, images, audio. ! ! ! ! Clustering and classification techniques can find/extract a smaller dimensional representation that retains meaning.
  • 30. Deep Learning is just (very) fancy dimensionality reduction
  • 31. Problem: Estimate the probability of rare events or events pertaining to new objects. E.g. click, like, comment, share
  • 32.
  • 33. Trick 3: Be a (Practical) Bayesian ! • If you have rare or new things you’d like to learn about, it’s often hard to say much. • But it’s sometimes easy to think of cases which are similar to the one you are trying to predict. • James-Stein estimators demonstrate that weighted averages including related observations will help improve predictions.
  • 34. 0 14 14 billion Philadelphia Facebook Eagles Revenue Wins
  • 35.
  • 36. Trick 4: Bootstrap all the statistics ! • The bootstrap allows you to get a sampling distribution over almost any statistic you can compute from your data. • Embarrassingly parallelizable / computable online. confidence intervals
  • 37. Bootstrapping in Practice R1 All Your Data R2 … R500 Generate random sub-samples s1 s2 … s500 Compute statistics or estimate model parameters } 7.5 5.0 2.5 0.0 -2 -1 0 1 2 Statistic Count Get a distribution over statistic of interest (usually the prediction) - take mean - CIs == 95% quantiles - SEs == standard deviation
  • 38. Grab bag of tricks • Everything is linear if you use enough features. • Matrix factorizations: NMF, SVD. • Probabilistic data structures: LSH, min-hash. • Exploit distributed, online algorithms as much as possible. • “A little bit of ridge never hurts.” — Trevor Hastie • Label propagation: use data about network neighbors. • Data reduction: create bins & analyze weighted bin stats.
  • 39. Last Mile of Data Science Magic
  • 40. Principle 1: Reliability ! “60% of the time, it works every time”
  • 41. Test-driven data science Learn how to build reliable data science systems from software engineers. 1. Write test fixtures with simulated or case-study data sets. 2. Write automated tests that check that your system works on fixtures, and add new ones when it doesn’t. 3. (Bonus) Test input data to ensure it meets all assumptions.
  • 42. Principle 2: Latency + Interactivity ! “how many hypotheses per second are you testing/generating?”
  • 43. Answer more questions People have good intuitions and tend to search effectively given understandable tools. First order effect of speed: more answers per second. Second order effect of speed: more questions asked. Deltoid: effortless experimentation Scuba: in-memory, distributed, sampled database. Presto: aggressive caching, distributed SQL query engine
  • 44. Principle 3: Simplicity + Modularity
  • 45. Choose one thing to do very well ! • It makes it easier to optimize your technology. ! • It makes it easier for people to understand what it does. ! • It makes people more likely to build around it.
  • 47. Show people the most interesting things
  • 48. Tricks Explained • Planout: simplicity + modularity • Deltoid: effortless experiment analysis + bootstrap • ClustR: dimensionality reduction + interactivity • Prophet: everything’s linear + basis expansion + new data • Crystal Ball: everything’s linear + regularization + speed • Hive / Presto / Scuba: reliability/latency tradeoffs
  • 49. 1. Learn as many tricks as you can 2. Combine them in novel ways 3. Consider the last mile
  • 50. sjt@fb.com http://seanjtaylor.com (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0