SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
Common Limitations with Big Data
in the Field
8/7/2014
Sean J. Taylor
Joint Statistical Meetings
Source: “big data” from Google
Mo’ Data,
Mo’ Problems
Why are the data so big?
data
“To consult the
statistician after an
experiment is finished is
often merely to ask him
to conduct a post-
mortem examination.
He can perhaps say what
the experiment died of.”
— Ronald Fisher
Data Alchemy?
Common Big Data Limitations:
!
1. Measuring the wrong thing
2. Biased samples
3. Dependence
4. Uncommon support
1. Measuring the wrong thing
The data you have a lot of doesn’t
measure what you want.
Common Pattern
▪ High volume of of cheap, easy to measure “surrogate” 

(e.g. steps, clicks)
▪ Surrogate is correlated with true measurement of interest
(e.g. overall health, purchase intention)
▪ key question: sign and magnitude of “interpretation bias”
▪ obvious strategy: design better measurement
▪ opportunity: joint, causal modeling of high-resolution
surrogate measurement with low-resolution “true”
measurement.
2. Biased samples
Size doesn’t guarantee your data is
representative of the population.
Bias?
World Cup Arrivals by Country
Is Liberal
Has
Daughter
Is Liberal
on
Facebook
Has
Daughter
on
Facebook
What we want
What we get
Common Pattern
▪ Your sample conditions on your own convenience, user
activity level, and/or technological sophistication of
participants.
▪ You’d like to extrapolate your statistic to a more general
population of interest.
▪ key question: (again) sign and magnitude of bias
▪ obvious strategy: weighting or regression
▪ opportunity: understand effect of selection on latent
variables on bias
3. Dependence
Effective size of data is small due
to repeated measurements of
units.
Bakshy et al. (2012)
“Social Influence in Social Advertising”
Person D Y
Evan 1 1
Evan 0 0
Ashley 0 1
Ashley 1 1
Ashley 0 1
Greg 1 0
Leena 1 0
Leena 1 1
Ema 0 0
Seamus 1 1
Person Ad D Y
Evan Sprite 1 1
Evan Coke 0 0
Ashley Pepsi 0 1
Ashley Pepsi 1 1
Ashley Coke 0 1
Greg Coke 1 0
Leena Pepsi 1 0
Leena Coke 1 1
Ema Sprite 0 0
Seamus Sprite 1 1
One-way Dependency Two-way Dependency
Yij = Dij + ↵i + j + ✏ijYij = Dij + ↵i + ✏ij
Bakshy and Eckles (2013)
“Uncertainty in Online Experiments with Dependent Data”
Common Pattern
▪ Your sample is large because it includes many repeated
measurements of the most active units.
▪ An inferential procedure that assumes independence will
be anti-conservative.
▪ key question: how to efficiently produce confidence
intervals with the right coverage.
▪ obvious strategy: multi-way bootstrap (Owen and Eckles)
▪ opportunity: scalable inference for crossed random effects
models
4. Uncommon support
Your data are sparse in part of the
distribution you care about.
1.0
1.5
2.0
2.5
3.0
−2 −1 0 1 2 3 4 5
Time Relative to Kickoff (hours)
Positive/NegativeEmotionWords
outcome
loss
win
“The Emotional Highs and Lows of the NFL Season”
Facebook Data Science Blog
Which check-ins are useful?
▪ We can only study people who post enough text before
and after their check-ins to yield reliable measurement.
What is the control check-in?
▪ Ideally we’d do a within-subjects design. If so we’re stuck
with only other check-ins (with enough status updates)
from the same people.
▪ National Park visit is likely a vacation, so we need to find
other vacation check-ins. Similar day, time, and distance.



Only a small fraction of the “treatment” cases yield
reliable measurement plus control cases.
Common Pattern
▪ Your sample is large but observations satisfying some set
of qualifying criteria (usually matching) are rare.
▪ Even with strong assumptions, hard to measure precise
differences due to lack of similar cases.
▪ key question: how to use as many of the observations as
possible without introducing bias.
▪ obvious strategy: randomized experiments
▪ opportunity: more efficient causal inference techniques
that work at scale (high dimensional regression/matching)
>
Conclusion 1:
Making your own quality data is
better than being a data alchemist.
Conclusion 2:
Great research opportunities in
addressing limitations of
“convenience” big data.
!
(or how I stopped worrying and
learned to be a data alchemist)
Sustaining versus Disruptive Innovations
Sustaining innovations
▪ improve product performance
Disruptive innovations
▪ Result in worse performance (short-term)
▪ Eventually surpass sustaining technologies in satisfying
market demand with lower costs
▪ cheaper, simpler, smaller, and frequently more
convenient to use
Innovator’s Dilemma
sjt@fb.com
http://seanjtaylor.com

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningNik Spirin
 
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleApplied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleDomino Data Lab
 
Correctness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up SeattleCorrectness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up SeattleDomino Data Lab
 
IE_expressyourself_EssayH
IE_expressyourself_EssayHIE_expressyourself_EssayH
IE_expressyourself_EssayHjk6653284
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big DataRevolution Analytics
 
DoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolDoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolAmit Sharma
 
Web science - How is it different?
Web science - How is it different?Web science - How is it different?
Web science - How is it different?Daniel Tunkelang
 
Trends on Pinterest
Trends on PinterestTrends on Pinterest
Trends on PinterestJune Andrews
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetupmortardata
 
2018 02 converged it
2018 02 converged it2018 02 converged it
2018 02 converged itChris Dwan
 
My Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningMy Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningDaniel Tunkelang
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace Mohamadreza Mohtat
 
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating ArenaThe Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating ArenaStoic Advantage, LLC.
 
What is the story with agile data keynote agile 2018 (Magennis)
What is the story with agile data keynote   agile 2018 (Magennis)What is the story with agile data keynote   agile 2018 (Magennis)
What is the story with agile data keynote agile 2018 (Magennis)Troy Magennis
 
Conversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie EardleyConversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie EardleyWebanalisten .nl
 
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Ogechi Onuoha
 
Predicting the Future With Microsoft Bing
Predicting the Future With Microsoft BingPredicting the Future With Microsoft Bing
Predicting the Future With Microsoft BingCybera Inc.
 

Was ist angesagt? (20)

Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleApplied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
 
Correctness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up SeattleCorrectness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up Seattle
 
IE_expressyourself_EssayH
IE_expressyourself_EssayHIE_expressyourself_EssayH
IE_expressyourself_EssayH
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
DoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolDoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End tool
 
Web science - How is it different?
Web science - How is it different?Web science - How is it different?
Web science - How is it different?
 
Trends on Pinterest
Trends on PinterestTrends on Pinterest
Trends on Pinterest
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetup
 
2018 02 converged it
2018 02 converged it2018 02 converged it
2018 02 converged it
 
Probabilistic Reasoner
Probabilistic ReasonerProbabilistic Reasoner
Probabilistic Reasoner
 
My Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningMy Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine Learning
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
Math in data
Math in dataMath in data
Math in data
 
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating ArenaThe Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
 
What is the story with agile data keynote agile 2018 (Magennis)
What is the story with agile data keynote   agile 2018 (Magennis)What is the story with agile data keynote   agile 2018 (Magennis)
What is the story with agile data keynote agile 2018 (Magennis)
 
Conversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie EardleyConversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie Eardley
 
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
 
Predicting the Future With Microsoft Bing
Predicting the Future With Microsoft BingPredicting the Future With Microsoft Bing
Predicting the Future With Microsoft Bing
 

Andere mochten auch

Automatic Forecasting at Scale
Automatic Forecasting at ScaleAutomatic Forecasting at Scale
Automatic Forecasting at ScaleSean Taylor
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experimentsSean Taylor
 
The Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipThe Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipTaylorPearson.me
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG DataPrasant Misra
 
Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?NUS-ISS
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop InnoTech
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Zekeriya Besiroglu
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBoston Consulting Group
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?Amazon Web Services
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web ServicesAmazon Web Services
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Andere mochten auch (19)

Automatic Forecasting at Scale
Automatic Forecasting at ScaleAutomatic Forecasting at Scale
Automatic Forecasting at Scale
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experiments
 
The Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipThe Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of Entrepreneurship
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?
 
RSlovakia #1 meetup
RSlovakia #1 meetupRSlovakia #1 meetup
RSlovakia #1 meetup
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web Services
 
Big Data
Big DataBig Data
Big Data
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Ähnlich wie Jsm big-data

sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
 
Data science and good questions eric kostello
Data science and good questions eric kostelloData science and good questions eric kostello
Data science and good questions eric kostelloData Con LA
 
321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)Iin Angriyani
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicatorsclearsateam
 
The Art and Science of Survey Research
The Art and Science of Survey ResearchThe Art and Science of Survey Research
The Art and Science of Survey ResearchSiobhan O'Dwyer
 
The Social Side of Behavioural Economics
The Social Side of Behavioural EconomicsThe Social Side of Behavioural Economics
The Social Side of Behavioural EconomicsDavid Perrott
 
School customer service presentation
School customer service presentationSchool customer service presentation
School customer service presentationsteve muzzy
 
Capturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised careCapturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised careVanessa Lopez
 
Audit and stat for medical professionals
Audit and stat for medical professionalsAudit and stat for medical professionals
Audit and stat for medical professionalsNadir Mehmood
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...Ruth Kearney
 
Book Summary : Everybody Lies
Book Summary : Everybody LiesBook Summary : Everybody Lies
Book Summary : Everybody LiesRahul Rishi
 
Data 101 Workshop
Data 101 WorkshopData 101 Workshop
Data 101 WorkshopWiLS
 
You Want Me to Measure What?
You Want Me to Measure What?You Want Me to Measure What?
You Want Me to Measure What?Dave Hogue
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementMark Reed
 
Big Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big DataBig Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big DataSylvia Ogweng
 
Data is love data viz best practices
Data is love   data viz best practicesData is love   data viz best practices
Data is love data viz best practicesGregory Nelson
 

Ähnlich wie Jsm big-data (20)

sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Data science and good questions eric kostello
Data science and good questions eric kostelloData science and good questions eric kostello
Data science and good questions eric kostello
 
321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicators
 
The Art and Science of Survey Research
The Art and Science of Survey ResearchThe Art and Science of Survey Research
The Art and Science of Survey Research
 
The Social Side of Behavioural Economics
The Social Side of Behavioural EconomicsThe Social Side of Behavioural Economics
The Social Side of Behavioural Economics
 
School customer service presentation
School customer service presentationSchool customer service presentation
School customer service presentation
 
تحليل البيانات وتفسير المعطيات
تحليل البيانات وتفسير المعطياتتحليل البيانات وتفسير المعطيات
تحليل البيانات وتفسير المعطيات
 
Capturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised careCapturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised care
 
Audit and stat for medical professionals
Audit and stat for medical professionalsAudit and stat for medical professionals
Audit and stat for medical professionals
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...
 
Book Summary : Everybody Lies
Book Summary : Everybody LiesBook Summary : Everybody Lies
Book Summary : Everybody Lies
 
intro_big_data.pptx
intro_big_data.pptxintro_big_data.pptx
intro_big_data.pptx
 
Data 101 Workshop
Data 101 WorkshopData 101 Workshop
Data 101 Workshop
 
You Want Me to Measure What?
You Want Me to Measure What?You Want Me to Measure What?
You Want Me to Measure What?
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and management
 
Big Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big DataBig Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big Data
 
Data is love data viz best practices
Data is love   data viz best practicesData is love   data viz best practices
Data is love data viz best practices
 

Kürzlich hochgeladen

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 

Kürzlich hochgeladen (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 

Jsm big-data

  • 1. Common Limitations with Big Data in the Field 8/7/2014 Sean J. Taylor Joint Statistical Meetings
  • 2.
  • 3.
  • 5.
  • 7. Why are the data so big?
  • 8.
  • 10. “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post- mortem examination. He can perhaps say what the experiment died of.” — Ronald Fisher
  • 12. Common Big Data Limitations: ! 1. Measuring the wrong thing 2. Biased samples 3. Dependence 4. Uncommon support
  • 13. 1. Measuring the wrong thing The data you have a lot of doesn’t measure what you want.
  • 14.
  • 15.
  • 16. Common Pattern ▪ High volume of of cheap, easy to measure “surrogate” 
 (e.g. steps, clicks) ▪ Surrogate is correlated with true measurement of interest (e.g. overall health, purchase intention) ▪ key question: sign and magnitude of “interpretation bias” ▪ obvious strategy: design better measurement ▪ opportunity: joint, causal modeling of high-resolution surrogate measurement with low-resolution “true” measurement.
  • 17. 2. Biased samples Size doesn’t guarantee your data is representative of the population.
  • 18.
  • 20.
  • 22. Common Pattern ▪ Your sample conditions on your own convenience, user activity level, and/or technological sophistication of participants. ▪ You’d like to extrapolate your statistic to a more general population of interest. ▪ key question: (again) sign and magnitude of bias ▪ obvious strategy: weighting or regression ▪ opportunity: understand effect of selection on latent variables on bias
  • 23. 3. Dependence Effective size of data is small due to repeated measurements of units.
  • 24. Bakshy et al. (2012) “Social Influence in Social Advertising”
  • 25. Person D Y Evan 1 1 Evan 0 0 Ashley 0 1 Ashley 1 1 Ashley 0 1 Greg 1 0 Leena 1 0 Leena 1 1 Ema 0 0 Seamus 1 1 Person Ad D Y Evan Sprite 1 1 Evan Coke 0 0 Ashley Pepsi 0 1 Ashley Pepsi 1 1 Ashley Coke 0 1 Greg Coke 1 0 Leena Pepsi 1 0 Leena Coke 1 1 Ema Sprite 0 0 Seamus Sprite 1 1 One-way Dependency Two-way Dependency Yij = Dij + ↵i + j + ✏ijYij = Dij + ↵i + ✏ij
  • 26. Bakshy and Eckles (2013) “Uncertainty in Online Experiments with Dependent Data”
  • 27. Common Pattern ▪ Your sample is large because it includes many repeated measurements of the most active units. ▪ An inferential procedure that assumes independence will be anti-conservative. ▪ key question: how to efficiently produce confidence intervals with the right coverage. ▪ obvious strategy: multi-way bootstrap (Owen and Eckles) ▪ opportunity: scalable inference for crossed random effects models
  • 28. 4. Uncommon support Your data are sparse in part of the distribution you care about.
  • 29.
  • 30. 1.0 1.5 2.0 2.5 3.0 −2 −1 0 1 2 3 4 5 Time Relative to Kickoff (hours) Positive/NegativeEmotionWords outcome loss win “The Emotional Highs and Lows of the NFL Season” Facebook Data Science Blog
  • 31. Which check-ins are useful? ▪ We can only study people who post enough text before and after their check-ins to yield reliable measurement. What is the control check-in? ▪ Ideally we’d do a within-subjects design. If so we’re stuck with only other check-ins (with enough status updates) from the same people. ▪ National Park visit is likely a vacation, so we need to find other vacation check-ins. Similar day, time, and distance.
 
 Only a small fraction of the “treatment” cases yield reliable measurement plus control cases.
  • 32. Common Pattern ▪ Your sample is large but observations satisfying some set of qualifying criteria (usually matching) are rare. ▪ Even with strong assumptions, hard to measure precise differences due to lack of similar cases. ▪ key question: how to use as many of the observations as possible without introducing bias. ▪ obvious strategy: randomized experiments ▪ opportunity: more efficient causal inference techniques that work at scale (high dimensional regression/matching)
  • 33. > Conclusion 1: Making your own quality data is better than being a data alchemist.
  • 34. Conclusion 2: Great research opportunities in addressing limitations of “convenience” big data. ! (or how I stopped worrying and learned to be a data alchemist)
  • 35.
  • 36. Sustaining versus Disruptive Innovations Sustaining innovations ▪ improve product performance Disruptive innovations ▪ Result in worse performance (short-term) ▪ Eventually surpass sustaining technologies in satisfying market demand with lower costs ▪ cheaper, simpler, smaller, and frequently more convenient to use