Secondary data analysis with digital trace data

•

1 gefällt mir•924 views

Andrea Wiggins

Technologie Bildung

Secondary Data Analysis
• Uses existing data produced or collected by
someone else, usually for a different purpose
• Databases
• Repositories
• Surveys
• Emails
• Social networks
2

Digital Trace Data
• Records of activity (trace data) undertaken through
an online information system (thus digital)
• Increasingly common in studies of online
phenomena
• Large volumes of available data
• Can be complete: a census, not a sample
• May be more reliably recorded than other data

3

Characteristics

1. Found data (not produced for research)
2. Event-based data (not summary data)
3. Events occur over time, so it is longitudinal data

4

Requirements
• Understand the original data source
• How it was collected, potential problems
• Limitations of the sample
• What the data describe
• Match with appropriate analysis methods and measures
• New types of data may require new measures
• Theoretical coherence is very important
5

Advantages
• Data may be “complete”
• Usually no response bias (exception: cookies)
• May cover long periods of time and large groups
• Multiple different data types, but mostly textual
• Data are often easy to acquire
• APIs or scraping web pages (with caution)
• Databases, archives, or repositories of research data
• But remember: you usually get what you pay for!
6

Disadvantages
• Often difﬁcult to know limitations of data
• Data may be poorly documented
• Original creator may not be available for comment
• Volume of data can be overwhelming
• Sampling strategies needed, e.g., temporal, random
• Substantial time required for data preparation: 90% of effort
• Exceptions are everywhere and will break analyses, but can
only be discovered through trial and error

7

Example: Email Networks
• Data source: email listservs for FLOSS projects
• Analysis approach: create social networks
• Within discussion threads, individuals are nodes, and links
are reply-to messages
• Some conceptual issues for interpretation, choice of
measures
• Technical challenges
• Temporal aggregation
• Identity resolution
8

Figures from Howison et al., 2006

Temporal Aggregation
9

Network Results
• Different levels of correlation
between venues, suggesting different
types of interactions
• User venues more decentralized than
developer venues, reﬂecting greater
number of participants
• Overall trend toward decentralization
could be result of different inﬂuences

• Observed anomalous patterns in trackers for
both projects: periodic centralization spikes
Cleaning up before shutting down
• A single user makes batch bug closings
(up to 279!)
– Fire’s (feature request) tracker housekeeping
appears to be preparation for project
closure
– Gaim’s tracker housekeeping was more
regular and repeated
11

Example: Classiﬁcation
• Replication of success-tragedy classiﬁcation
• Classiﬁcation criteria originally drawn from
interviews with community members
• Data extracted from repositories
• Technical challenges
• Merging data from two repositories
• Processing large volume of data in multiple steps
12

Variables
• Inputs: project names and 5 threshold values for
classiﬁcation tests, e.g. number of downloads
• Project statistics retrieved from repositories
• Founding date
• Data collection date
• Dates for all releases
• Number of downloads
• URL
13

Classiﬁcation Results
Class Original Our results Difference
unclassiﬁabl 3 186 3 296 +110
e
II 13 342 (12%) 16 252 (14%) +2 910 (+2%)

IG 10 711 (10%) 12 991 (11%) +2 280 (+1%)

TI 37 320 (35%) 36 507 (31%) -813 (-4%)

TG 30 592 (28%) 32 642 (28%) +2 050 (0%)

SG 15 782 (15%) 16 045 (14%) +263 (-1%)

other 8 422 0

Total 119 355 117 733

15

Empfohlen

Collaborative Data Analysis with Taverna WorkflowsAndrea Wiggins

Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot

Reproducible research: theoryC. Tobin Magle

Reproducibility and replicability: a practical approachKrzysztof Gorgolewski

OpenNeuro: a free online platform for sharing and analysis of neuroimaging dataKrzysztof Gorgolewski

A practical guide to practicing open scienceKrzysztof Gorgolewski

Citation and reproducibility in softwareDaniel S. Katz

Software Citation: Principles, Implementation, and ImpactDaniel S. Katz

Empfohlen

Collaborative Data Analysis with Taverna WorkflowsAndrea Wiggins

Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot

Reproducible research: theoryC. Tobin Magle

Reproducibility and replicability: a practical approachKrzysztof Gorgolewski

OpenNeuro: a free online platform for sharing and analysis of neuroimaging dataKrzysztof Gorgolewski

A practical guide to practicing open scienceKrzysztof Gorgolewski

Citation and reproducibility in softwareDaniel S. Katz

Software Citation: Principles, Implementation, and ImpactDaniel S. Katz

Software Ecosystems = Big DataTom Mens

HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckTao Xie

Software Analytics: Towards Software Mining that MattersTao Xie

20171003 lancaster data conversations Chue-HongLancaster University Library

Being Reproducible: SSBSS Summer School 2017Carole Goble

Micropublication WormBase Workshop International Worm Meeting 2015raymond91105

Scientific Software - what happens after the grant?James Howison

Modern tools for sharing and synthesizing neuroimaging resultsKrzysztof Gorgolewski

User Expectations in Mobile App SecurityTao Xie

Software Mining and Software DatasetsTao Xie

Large Scale Studies: Malware Needles in a HaystackMarcus Botacin

Intro to Reproducible ResearchC. Tobin Magle

Getting (and giving) credit for all that we domhaendel

Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble

Automating the process of continuously prioritising data, updating and deploy...Ola Spjuth

Avoiding the tower of babel - The Role of Data Description Standards in Biome...Krzysztof Gorgolewski

ROHubRaul Palma

Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)Kristin Briney

Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart

FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble

Birdsciffer louis

With Great Data Comes Great ResponsibilityAndrea Wiggins

Weitere ähnliche Inhalte

Was ist angesagt?

Software Ecosystems = Big DataTom Mens

HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckTao Xie

Software Analytics: Towards Software Mining that MattersTao Xie

20171003 lancaster data conversations Chue-HongLancaster University Library

Being Reproducible: SSBSS Summer School 2017Carole Goble

Micropublication WormBase Workshop International Worm Meeting 2015raymond91105

Scientific Software - what happens after the grant?James Howison

Modern tools for sharing and synthesizing neuroimaging resultsKrzysztof Gorgolewski

User Expectations in Mobile App SecurityTao Xie

Software Mining and Software DatasetsTao Xie

Large Scale Studies: Malware Needles in a HaystackMarcus Botacin

Intro to Reproducible ResearchC. Tobin Magle

Getting (and giving) credit for all that we domhaendel

Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble

Automating the process of continuously prioritising data, updating and deploy...Ola Spjuth

Avoiding the tower of babel - The Role of Data Description Standards in Biome...Krzysztof Gorgolewski

ROHubRaul Palma

Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)Kristin Briney

Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart

FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble

Was ist angesagt? (20)

Software Ecosystems = Big Data

HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck

Software Analytics: Towards Software Mining that Matters

20171003 lancaster data conversations Chue-Hong

Being Reproducible: SSBSS Summer School 2017

Micropublication WormBase Workshop International Worm Meeting 2015

Scientific Software - what happens after the grant?

Modern tools for sharing and synthesizing neuroimaging results

User Expectations in Mobile App Security

Software Mining and Software Datasets

Large Scale Studies: Malware Needles in a Haystack

Intro to Reproducible Research

Getting (and giving) credit for all that we do

Being FAIR: FAIR data and model management SSBSS 2017 Summer School

Automating the process of continuously prioritising data, updating and deploy...

Avoiding the tower of babel - The Role of Data Description Standards in Biome...

ROHub

Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)

Research Data (and Software) Management at Imperial: (Everything you need to ...

FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...

Andere mochten auch

Birdsciffer louis

With Great Data Comes Great ResponsibilityAndrea Wiggins

Moselleciffer louis

National Park System Property DesignationsAndrea Wiggins

secondary data analysis for MS advance research one Lecture eightUniversity of Balochistan

Content Analysis vs secondary analysisDr. Cupid Lucid

Secondary data collection.mjmmanjunath

Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...David Rozas

Ch11 Agency Records, Content Analysis, and Secondary Datayxl007

Secondary Data AnalysisKeith Lyons

Harvard Housing.Marketing Research.Case StudySkalla Marketing

Business Research Methods. problem definition literature review and qualitati...Ahsan Khan Eco (Superior College)

Primary & secondary datahezel3210

Andere mochten auch (13)

Birds

With Great Data Comes Great Responsibility

Moselle

National Park System Property Designations

secondary data analysis for MS advance research one Lecture eight

Content Analysis vs secondary analysis

Secondary data collection.mjm

Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...

Ch11 Agency Records, Content Analysis, and Secondary Data

Secondary Data Analysis

Harvard Housing.Marketing Research.Case Study

Business Research Methods. problem definition literature review and qualitati...

Primary & secondary data

Ähnlich wie Secondary data analysis with digital trace data

Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j

Data Description Registry Interoperability WG at Research Data Alliance Third...amiraryani

DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxrandyburney60861

Web Scale Discovery Reality CheckJeff Wisniewski

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...National Information Standards Organization (NISO)

Incentivising the uptake of reusable metadata in the survey production processLouise Corti

Industrial Data ScienceNiko Vuokko

Data cycle healthjyotikhadake

Electronic Lab NotebooksKristin Briney

Graham PryorEduserv

Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo

Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll

Hydra Project Management SurveyMark Notess

Towards an Agile approach to building application profilesPaul Walk

2016 Ocean Sciences Meeting tutorialJosh Young

Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...benaam

FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM

Introduction to Digital PreservationBill LeFurgy

Data Virtualization Reference Architectures: Correctly Architecting your Solu...Denodo

Tutorial Data Management and workflowsSSSW

Ähnlich wie Secondary data analysis with digital trace data (20)

Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013

Data Description Registry Interoperability WG at Research Data Alliance Third...

DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx

Web Scale Discovery Reality Check

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...

Incentivising the uptake of reusable metadata in the survey production process

Industrial Data Science

Data cycle health

Electronic Lab Notebooks

Graham Pryor

Data Lake Acceleration vs. Data Virtualization - What’s the difference?

Large Scale Search, Discovery and Analytics in Action

Hydra Project Management Survey

Towards an Agile approach to building application profiles

2016 Ocean Sciences Meeting tutorial

Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...

FAIRDOM data management support for ERACoBioTech Proposals

Introduction to Digital Preservation

Data Virtualization Reference Architectures: Correctly Architecting your Solu...

Tutorial Data Management and workflows

Mehr von Andrea Wiggins

Online Communities in Citizen Science & BirdCamsAndrea Wiggins

Free as in Puppies: Compensating for ICT Constraints in Citizen ScienceAndrea Wiggins

Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...Andrea Wiggins

Online Communities in Citizen ScienceAndrea Wiggins

Citizen Science PhenotypesAndrea Wiggins

The Evolving Landscape of Citizen ScienceAndrea Wiggins

Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...Andrea Wiggins

Data Management for Citizen ScienceAndrea Wiggins

Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...Andrea Wiggins

Mechanisms for Data Quality and Validation in Citizen ScienceAndrea Wiggins

Open Source & Citizen ScienceAndrea Wiggins

From Conservation to Crowdsourcing: A Typology of Citizen ScienceAndrea Wiggins

Motivation by Design: Technologies, Experiences, and IncentivesAndrea Wiggins

Data Intensive Collaboration in Science and Engineering: CSCW workshop themesAndrea Wiggins

Open Source, Open Science, & Citizen ScienceAndrea Wiggins

Reclassifying Success and Tragedy in FLOSS ProjectsAndrea Wiggins

Crowdsourcing ScienceAndrea Wiggins

Intellectual Diversity in the iSchools: Past, Present and FutureAndrea Wiggins

Distributed Scientific Collaboration: Research Opportunities in Citizen ScienceAndrea Wiggins

Designing Virtual Organizations for Citizen ScienceAndrea Wiggins

Mehr von Andrea Wiggins (20)

Online Communities in Citizen Science & BirdCams

Free as in Puppies: Compensating for ICT Constraints in Citizen Science

Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...

Online Communities in Citizen Science

Citizen Science Phenotypes

The Evolving Landscape of Citizen Science

Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...

Data Management for Citizen Science

Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...

Mechanisms for Data Quality and Validation in Citizen Science

Open Source & Citizen Science

From Conservation to Crowdsourcing: A Typology of Citizen Science

Motivation by Design: Technologies, Experiences, and Incentives

Data Intensive Collaboration in Science and Engineering: CSCW workshop themes

Open Source, Open Science, & Citizen Science

Reclassifying Success and Tragedy in FLOSS Projects

Crowdsourcing Science

Intellectual Diversity in the iSchools: Past, Present and Future

Distributed Scientific Collaboration: Research Opportunities in Citizen Science

Designing Virtual Organizations for Citizen Science

Kürzlich hochgeladen

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Search Engine Optimization SEO PDF for 2024.pdfRankYa

"ML in Production",Oleksandr BaganFwdays

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

CloudStudio User manual (basic edition):comworks

Kürzlich hochgeladen (20)

Anypoint Exchange: It’s Not Just a Repo!

Human Factors of XR: Using Human Factors to Design XR Systems

Advanced Test Driven-Development @ php[tek] 2024

Developer Data Modeling Mistakes: From Postgres to NoSQL

DevoxxFR 2024 Reproducible Builds with Apache Maven

Vertex AI Gemini Prompt Engineering Tips

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Designing IA for AI - Information Architecture Conference 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Search Engine Optimization SEO PDF for 2024.pdf

"ML in Production",Oleksandr Bagan

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Nell’iperspazio con Rocket: il Framework Web di Rust!

SIP trunking in Janus @ Kamailio World 2024

Dev Dives: Streamline document processing with UiPath Studio Web

Unleash Your Potential - Namagunga Girls Coding Club

Streamlining Python Development: A Guide to a Modern Project Setup

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

CloudStudio User manual (basic edition):

Secondary data analysis with digital trace data

1. Secondary data analysis with digital trace data Examples from FLOSS research Andrea Wiggins 13 Juillet, 2011

2. Secondary Data Analysis • Uses existing data produced or collected by someone else, usually for a different purpose • Databases • Repositories • Surveys • Emails • Social networks 2

3. Digital Trace Data • Records of activity (trace data) undertaken through an online information system (thus digital) • Increasingly common in studies of online phenomena • Large volumes of available data • Can be complete: a census, not a sample • May be more reliably recorded than other data 3

4. Characteristics 1. Found data (not produced for research) 2. Event-based data (not summary data) 3. Events occur over time, so it is longitudinal data 4

5. Requirements • Understand the original data source • How it was collected, potential problems • Limitations of the sample • What the data describe • Match with appropriate analysis methods and measures • New types of data may require new measures • Theoretical coherence is very important 5

6. Advantages • Data may be “complete” • Usually no response bias (exception: cookies) • May cover long periods of time and large groups • Multiple different data types, but mostly textual • Data are often easy to acquire • APIs or scraping web pages (with caution) • Databases, archives, or repositories of research data • But remember: you usually get what you pay for! 6

7. Disadvantages • Often difﬁcult to know limitations of data • Data may be poorly documented • Original creator may not be available for comment • Volume of data can be overwhelming • Sampling strategies needed, e.g., temporal, random • Substantial time required for data preparation: 90% of effort • Exceptions are everywhere and will break analyses, but can only be discovered through trial and error 7

8. Example: Email Networks • Data source: email listservs for FLOSS projects • Analysis approach: create social networks • Within discussion threads, individuals are nodes, and links are reply-to messages • Some conceptual issues for interpretation, choice of measures • Technical challenges • Temporal aggregation • Identity resolution 8

9. Figures from Howison et al., 2006 Temporal Aggregation 9

10. Network Workﬂow 10

11. Network Results • Different levels of correlation between venues, suggesting different types of interactions • User venues more decentralized than developer venues, reﬂecting greater number of participants • Overall trend toward decentralization could be result of different inﬂuences • Observed anomalous patterns in trackers for both projects: periodic centralization spikes Cleaning up before shutting down • A single user makes batch bug closings (up to 279!) – Fire’s (feature request) tracker housekeeping appears to be preparation for project closure – Gaim’s tracker housekeeping was more regular and repeated 11

12. Example: Classification • Replication of success-tragedy classification • Classification criteria originally drawn from interviews with community members • Data extracted from repositories • Technical challenges • Merging data from two repositories • Processing large volume of data in multiple steps 12

13. Variables • Inputs: project names and 5 threshold values for classiﬁcation tests, e.g. number of downloads • Project statistics retrieved from repositories • Founding date • Data collection date • Dates for all releases • Number of downloads • URL 13

14. Classiﬁcation workﬂow 14

15. Classiﬁcation Results Class Original Our results Difference unclassiﬁabl 3 186 3 296 +110 e II 13 342 (12%) 16 252 (14%) +2 910 (+2%) IG 10 711 (10%) 12 991 (11%) +2 280 (+1%) TI 37 320 (35%) 36 507 (31%) -813 (-4%) TG 30 592 (28%) 32 642 (28%) +2 050 (0%) SG 15 782 (15%) 16 045 (14%) +263 (-1%) other 8 422 0 Total 119 355 117 733 15

16. Thanks! • Questions? 16