SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Secondary data analysis
  with digital trace data

Examples from FLOSS research

         Andrea Wiggins
         13 Juillet, 2011
Secondary Data Analysis
•   Uses existing data produced or collected by
    someone else, usually for a different purpose
    •   Databases
    •   Repositories
    •   Surveys
    •   Emails
    •   Social networks
                           2
Digital Trace Data
•   Records of activity (trace data) undertaken through
    an online information system (thus digital)
•   Increasingly common in studies of online
    phenomena
    •   Large volumes of available data
    •   Can be complete: a census, not a sample
    •   May be more reliably recorded than other data

                             3
Characteristics


1. Found data (not produced for research)
2. Event-based data (not summary data)
3. Events occur over time, so it is longitudinal data




                          4
Requirements
•   Understand the original data source
    •   How it was collected, potential problems
    •   Limitations of the sample
    •   What the data describe
•   Match with appropriate analysis methods and measures
    •   New types of data may require new measures
    •   Theoretical coherence is very important
                              5
Advantages
•   Data may be “complete”
    •   Usually no response bias (exception: cookies)
    •   May cover long periods of time and large groups
    •   Multiple different data types, but mostly textual
•   Data are often easy to acquire
    •   APIs or scraping web pages (with caution)
    •   Databases, archives, or repositories of research data
•   But remember: you usually get what you pay for!
                                  6
Disadvantages
•   Often difficult to know limitations of data
    •   Data may be poorly documented
    •   Original creator may not be available for comment
•   Volume of data can be overwhelming
    •   Sampling strategies needed, e.g., temporal, random
    •   Substantial time required for data preparation: 90% of effort
    •   Exceptions are everywhere and will break analyses, but can
        only be discovered through trial and error

                                  7
Example: Email Networks
•   Data source: email listservs for FLOSS projects
•   Analysis approach: create social networks
    •   Within discussion threads, individuals are nodes, and links
        are reply-to messages
    •   Some conceptual issues for interpretation, choice of
        measures
•   Technical challenges
    •   Temporal aggregation
    •   Identity resolution
                                   8
Figures from Howison et al., 2006


Temporal Aggregation
                  9
Network Workflow
       10
Network Results
                                                     • Different levels of correlation
                                                       between venues, suggesting different
                                                       types of interactions
                                                     • User venues more decentralized than
                                                       developer venues, reflecting greater
                                                       number of participants
                                                     • Overall trend toward decentralization
                                                       could be result of different influences

• Observed anomalous patterns in trackers for
  both projects: periodic centralization spikes
                                                                Cleaning up before shutting down
• A single user makes batch bug closings
  (up to 279!)
   – Fire’s (feature request) tracker housekeeping
     appears to be preparation for project
     closure
   – Gaim’s tracker housekeeping was more
     regular and repeated
                                              11
Example: Classification
•   Replication of success-tragedy classification
    •   Classification criteria originally drawn from
        interviews with community members
    •   Data extracted from repositories
•   Technical challenges
    •   Merging data from two repositories
    •   Processing large volume of data in multiple steps
                             12
Variables
•   Inputs: project names and 5 threshold values for
    classification tests, e.g. number of downloads
•   Project statistics retrieved from repositories
    •   Founding date
    •   Data collection date
    •   Dates for all releases
    •   Number of downloads
    •   URL
                                 13
Classification workflow
          14
Classification Results
   Class        Original           Our results    Difference
unclassifiabl      3 186               3 296          +110
     e
     II        13 342 (12%)        16 252 (14%)   +2 910 (+2%)

    IG         10 711 (10%)        12 991 (11%)   +2 280 (+1%)

    TI         37 320 (35%)        36 507 (31%)    -813 (-4%)

    TG         30 592 (28%)        32 642 (28%)   +2 050 (0%)

    SG         15 782 (15%)        16 045 (14%)    +263 (-1%)

   other          8 422                 0

   Total         119 355             117 733

                              15
Thanks!



•   Questions?




                    16

Weitere ähnliche Inhalte

Was ist angesagt?

Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big DataTom Mens
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckTao Xie
 
Software Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersSoftware Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersTao Xie
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Carole Goble
 
Micropublication WormBase Workshop International Worm Meeting 2015
Micropublication WormBase Workshop International Worm Meeting 2015Micropublication WormBase Workshop International Worm Meeting 2015
Micropublication WormBase Workshop International Worm Meeting 2015raymond91105
 
Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?James Howison
 
Modern tools for sharing and synthesizing neuroimaging results
Modern tools for sharing and synthesizing neuroimaging resultsModern tools for sharing and synthesizing neuroimaging results
Modern tools for sharing and synthesizing neuroimaging resultsKrzysztof Gorgolewski
 
User Expectations in Mobile App Security
User Expectations in Mobile App SecurityUser Expectations in Mobile App Security
User Expectations in Mobile App SecurityTao Xie
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software DatasetsTao Xie
 
Large Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a HaystackLarge Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a HaystackMarcus Botacin
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible ResearchC. Tobin Magle
 
Getting (and giving) credit for all that we do
Getting (and giving) credit for all that we doGetting (and giving) credit for all that we do
Getting (and giving) credit for all that we domhaendel
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Ola Spjuth
 
Avoiding the tower of babel - The Role of Data Description Standards in Biome...
Avoiding the tower of babel - The Role of Data Description Standards in Biome...Avoiding the tower of babel - The Role of Data Description Standards in Biome...
Avoiding the tower of babel - The Role of Data Description Standards in Biome...Krzysztof Gorgolewski
 
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)Kristin Briney
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
 

Was ist angesagt? (20)

Software Ecosystems = Big Data
Software Ecosystems = Big DataSoftware Ecosystems = Big Data
Software Ecosystems = Big Data
 
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
 
Software Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersSoftware Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that Matters
 
20171003 lancaster data conversations Chue-Hong
20171003 lancaster data conversations Chue-Hong20171003 lancaster data conversations Chue-Hong
20171003 lancaster data conversations Chue-Hong
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017
 
Micropublication WormBase Workshop International Worm Meeting 2015
Micropublication WormBase Workshop International Worm Meeting 2015Micropublication WormBase Workshop International Worm Meeting 2015
Micropublication WormBase Workshop International Worm Meeting 2015
 
Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?
 
Modern tools for sharing and synthesizing neuroimaging results
Modern tools for sharing and synthesizing neuroimaging resultsModern tools for sharing and synthesizing neuroimaging results
Modern tools for sharing and synthesizing neuroimaging results
 
User Expectations in Mobile App Security
User Expectations in Mobile App SecurityUser Expectations in Mobile App Security
User Expectations in Mobile App Security
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
Large Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a HaystackLarge Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a Haystack
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible Research
 
Getting (and giving) credit for all that we do
Getting (and giving) credit for all that we doGetting (and giving) credit for all that we do
Getting (and giving) credit for all that we do
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
 
Avoiding the tower of babel - The Role of Data Description Standards in Biome...
Avoiding the tower of babel - The Role of Data Description Standards in Biome...Avoiding the tower of babel - The Role of Data Description Standards in Biome...
Avoiding the tower of babel - The Role of Data Description Standards in Biome...
 
ROHub
ROHubROHub
ROHub
 
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
 

Andere mochten auch

With Great Data Comes Great Responsibility
With Great Data Comes Great ResponsibilityWith Great Data Comes Great Responsibility
With Great Data Comes Great ResponsibilityAndrea Wiggins
 
National Park System Property Designations
National Park System Property DesignationsNational Park System Property Designations
National Park System Property DesignationsAndrea Wiggins
 
secondary data analysis for MS advance research one Lecture eight
secondary data analysis for MS advance research one Lecture eightsecondary data analysis for MS advance research one Lecture eight
secondary data analysis for MS advance research one Lecture eightUniversity of Balochistan
 
Content Analysis vs secondary analysis
Content Analysis vs secondary analysisContent Analysis vs secondary analysis
Content Analysis vs secondary analysisDr. Cupid Lucid
 
Secondary data collection.mjm
Secondary data collection.mjmSecondary data collection.mjm
Secondary data collection.mjmmanjunath
 
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...David Rozas
 
Ch11 Agency Records, Content Analysis, and Secondary Data
Ch11 Agency Records, Content Analysis, and Secondary DataCh11 Agency Records, Content Analysis, and Secondary Data
Ch11 Agency Records, Content Analysis, and Secondary Datayxl007
 
Secondary Data Analysis
Secondary Data AnalysisSecondary Data Analysis
Secondary Data AnalysisKeith Lyons
 
Harvard Housing.Marketing Research.Case Study
Harvard Housing.Marketing Research.Case StudyHarvard Housing.Marketing Research.Case Study
Harvard Housing.Marketing Research.Case StudySkalla Marketing
 
Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Ahsan Khan Eco (Superior College)
 
Primary & secondary data
Primary & secondary dataPrimary & secondary data
Primary & secondary datahezel3210
 

Andere mochten auch (13)

Birds
BirdsBirds
Birds
 
With Great Data Comes Great Responsibility
With Great Data Comes Great ResponsibilityWith Great Data Comes Great Responsibility
With Great Data Comes Great Responsibility
 
Moselle
MoselleMoselle
Moselle
 
National Park System Property Designations
National Park System Property DesignationsNational Park System Property Designations
National Park System Property Designations
 
secondary data analysis for MS advance research one Lecture eight
secondary data analysis for MS advance research one Lecture eightsecondary data analysis for MS advance research one Lecture eight
secondary data analysis for MS advance research one Lecture eight
 
Content Analysis vs secondary analysis
Content Analysis vs secondary analysisContent Analysis vs secondary analysis
Content Analysis vs secondary analysis
 
Secondary data collection.mjm
Secondary data collection.mjmSecondary data collection.mjm
Secondary data collection.mjm
 
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
Quantitative Methods II (#SOC2031). Seminar #11: Secondary analysis. Big data...
 
Ch11 Agency Records, Content Analysis, and Secondary Data
Ch11 Agency Records, Content Analysis, and Secondary DataCh11 Agency Records, Content Analysis, and Secondary Data
Ch11 Agency Records, Content Analysis, and Secondary Data
 
Secondary Data Analysis
Secondary Data AnalysisSecondary Data Analysis
Secondary Data Analysis
 
Harvard Housing.Marketing Research.Case Study
Harvard Housing.Marketing Research.Case StudyHarvard Housing.Marketing Research.Case Study
Harvard Housing.Marketing Research.Case Study
 
Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...Business Research Methods. problem definition literature review and qualitati...
Business Research Methods. problem definition literature review and qualitati...
 
Primary & secondary data
Primary & secondary dataPrimary & secondary data
Primary & secondary data
 

Ähnlich wie Secondary data analysis with digital trace data

Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j
 
Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...amiraryani
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxrandyburney60861
 
Web Scale Discovery Reality Check
Web Scale Discovery Reality CheckWeb Scale Discovery Reality Check
Web Scale Discovery Reality CheckJeff Wisniewski
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processLouise Corti
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data ScienceNiko Vuokko
 
Electronic Lab Notebooks
Electronic Lab NotebooksElectronic Lab Notebooks
Electronic Lab NotebooksKristin Briney
 
Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Hydra Project Management Survey
Hydra Project Management SurveyHydra Project Management Survey
Hydra Project Management SurveyMark Notess
 
Towards an Agile approach to building application profiles
Towards an Agile approach to building application profilesTowards an Agile approach to building application profiles
Towards an Agile approach to building application profilesPaul Walk
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorialJosh Young
 
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...benaam
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM
 
Introduction to Digital Preservation
Introduction to Digital PreservationIntroduction to Digital Preservation
Introduction to Digital PreservationBill LeFurgy
 
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Denodo
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 

Ähnlich wie Secondary data analysis with digital trace data (20)

Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
 
Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...Data Description Registry Interoperability WG at Research Data Alliance Third...
Data Description Registry Interoperability WG at Research Data Alliance Third...
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
 
Web Scale Discovery Reality Check
Web Scale Discovery Reality CheckWeb Scale Discovery Reality Check
Web Scale Discovery Reality Check
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production process
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Data cycle health
Data cycle healthData cycle health
Data cycle health
 
Electronic Lab Notebooks
Electronic Lab NotebooksElectronic Lab Notebooks
Electronic Lab Notebooks
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Hydra Project Management Survey
Hydra Project Management SurveyHydra Project Management Survey
Hydra Project Management Survey
 
Towards an Agile approach to building application profiles
Towards an Agile approach to building application profilesTowards an Agile approach to building application profiles
Towards an Agile approach to building application profiles
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial
 
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
Early Lessons from Building Sensor.Network: An Open Data Exchange for the Web...
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
Introduction to Digital Preservation
Introduction to Digital PreservationIntroduction to Digital Preservation
Introduction to Digital Preservation
 
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 

Mehr von Andrea Wiggins

Online Communities in Citizen Science & BirdCams
Online Communities in Citizen Science & BirdCamsOnline Communities in Citizen Science & BirdCams
Online Communities in Citizen Science & BirdCamsAndrea Wiggins
 
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen ScienceFree as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen ScienceAndrea Wiggins
 
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...Andrea Wiggins
 
Online Communities in Citizen Science
Online Communities in Citizen ScienceOnline Communities in Citizen Science
Online Communities in Citizen ScienceAndrea Wiggins
 
Citizen Science Phenotypes
Citizen Science PhenotypesCitizen Science Phenotypes
Citizen Science PhenotypesAndrea Wiggins
 
The Evolving Landscape of Citizen Science
The Evolving Landscape of Citizen ScienceThe Evolving Landscape of Citizen Science
The Evolving Landscape of Citizen ScienceAndrea Wiggins
 
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...Andrea Wiggins
 
Data Management for Citizen Science
Data Management for Citizen ScienceData Management for Citizen Science
Data Management for Citizen ScienceAndrea Wiggins
 
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...Andrea Wiggins
 
Mechanisms for Data Quality and Validation in Citizen Science
Mechanisms for Data Quality and Validation in Citizen ScienceMechanisms for Data Quality and Validation in Citizen Science
Mechanisms for Data Quality and Validation in Citizen ScienceAndrea Wiggins
 
Open Source & Citizen Science
Open Source & Citizen ScienceOpen Source & Citizen Science
Open Source & Citizen ScienceAndrea Wiggins
 
From Conservation to Crowdsourcing: A Typology of Citizen Science
From Conservation to Crowdsourcing: A Typology of Citizen ScienceFrom Conservation to Crowdsourcing: A Typology of Citizen Science
From Conservation to Crowdsourcing: A Typology of Citizen ScienceAndrea Wiggins
 
Motivation by Design: Technologies, Experiences, and Incentives
Motivation by Design: Technologies, Experiences, and IncentivesMotivation by Design: Technologies, Experiences, and Incentives
Motivation by Design: Technologies, Experiences, and IncentivesAndrea Wiggins
 
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themesData Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themesAndrea Wiggins
 
Open Source, Open Science, & Citizen Science
Open Source, Open Science, & Citizen ScienceOpen Source, Open Science, & Citizen Science
Open Source, Open Science, & Citizen ScienceAndrea Wiggins
 
Reclassifying Success and Tragedy in FLOSS Projects
Reclassifying Success and Tragedy in FLOSS ProjectsReclassifying Success and Tragedy in FLOSS Projects
Reclassifying Success and Tragedy in FLOSS ProjectsAndrea Wiggins
 
Intellectual Diversity in the iSchools: Past, Present and Future
Intellectual Diversity in the iSchools: Past, Present and FutureIntellectual Diversity in the iSchools: Past, Present and Future
Intellectual Diversity in the iSchools: Past, Present and FutureAndrea Wiggins
 
Distributed Scientific Collaboration: Research Opportunities in Citizen Science
Distributed Scientific Collaboration: Research Opportunities in Citizen ScienceDistributed Scientific Collaboration: Research Opportunities in Citizen Science
Distributed Scientific Collaboration: Research Opportunities in Citizen ScienceAndrea Wiggins
 
Designing Virtual Organizations for Citizen Science
Designing Virtual Organizations for Citizen ScienceDesigning Virtual Organizations for Citizen Science
Designing Virtual Organizations for Citizen ScienceAndrea Wiggins
 

Mehr von Andrea Wiggins (20)

Online Communities in Citizen Science & BirdCams
Online Communities in Citizen Science & BirdCamsOnline Communities in Citizen Science & BirdCams
Online Communities in Citizen Science & BirdCams
 
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen ScienceFree as in Puppies: Compensating for ICT Constraints in Citizen Science
Free as in Puppies: Compensating for ICT Constraints in Citizen Science
 
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
Crowdsourcing Citizen Science Data Quality with a Human-Computer Learning Net...
 
Online Communities in Citizen Science
Online Communities in Citizen ScienceOnline Communities in Citizen Science
Online Communities in Citizen Science
 
Citizen Science Phenotypes
Citizen Science PhenotypesCitizen Science Phenotypes
Citizen Science Phenotypes
 
The Evolving Landscape of Citizen Science
The Evolving Landscape of Citizen ScienceThe Evolving Landscape of Citizen Science
The Evolving Landscape of Citizen Science
 
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
Citizen Science 101: What Every Researcher Should Know About Crowdsourcing Sc...
 
Data Management for Citizen Science
Data Management for Citizen ScienceData Management for Citizen Science
Data Management for Citizen Science
 
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
Crowdsourcing Scientific Work: A Comparative Study of Technologies, Processes...
 
Mechanisms for Data Quality and Validation in Citizen Science
Mechanisms for Data Quality and Validation in Citizen ScienceMechanisms for Data Quality and Validation in Citizen Science
Mechanisms for Data Quality and Validation in Citizen Science
 
Open Source & Citizen Science
Open Source & Citizen ScienceOpen Source & Citizen Science
Open Source & Citizen Science
 
From Conservation to Crowdsourcing: A Typology of Citizen Science
From Conservation to Crowdsourcing: A Typology of Citizen ScienceFrom Conservation to Crowdsourcing: A Typology of Citizen Science
From Conservation to Crowdsourcing: A Typology of Citizen Science
 
Motivation by Design: Technologies, Experiences, and Incentives
Motivation by Design: Technologies, Experiences, and IncentivesMotivation by Design: Technologies, Experiences, and Incentives
Motivation by Design: Technologies, Experiences, and Incentives
 
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themesData Intensive Collaboration in Science and Engineering: CSCW workshop themes
Data Intensive Collaboration in Science and Engineering: CSCW workshop themes
 
Open Source, Open Science, & Citizen Science
Open Source, Open Science, & Citizen ScienceOpen Source, Open Science, & Citizen Science
Open Source, Open Science, & Citizen Science
 
Reclassifying Success and Tragedy in FLOSS Projects
Reclassifying Success and Tragedy in FLOSS ProjectsReclassifying Success and Tragedy in FLOSS Projects
Reclassifying Success and Tragedy in FLOSS Projects
 
Crowdsourcing Science
Crowdsourcing ScienceCrowdsourcing Science
Crowdsourcing Science
 
Intellectual Diversity in the iSchools: Past, Present and Future
Intellectual Diversity in the iSchools: Past, Present and FutureIntellectual Diversity in the iSchools: Past, Present and Future
Intellectual Diversity in the iSchools: Past, Present and Future
 
Distributed Scientific Collaboration: Research Opportunities in Citizen Science
Distributed Scientific Collaboration: Research Opportunities in Citizen ScienceDistributed Scientific Collaboration: Research Opportunities in Citizen Science
Distributed Scientific Collaboration: Research Opportunities in Citizen Science
 
Designing Virtual Organizations for Citizen Science
Designing Virtual Organizations for Citizen ScienceDesigning Virtual Organizations for Citizen Science
Designing Virtual Organizations for Citizen Science
 

Kürzlich hochgeladen

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Kürzlich hochgeladen (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Secondary data analysis with digital trace data

  • 1. Secondary data analysis with digital trace data Examples from FLOSS research Andrea Wiggins 13 Juillet, 2011
  • 2. Secondary Data Analysis • Uses existing data produced or collected by someone else, usually for a different purpose • Databases • Repositories • Surveys • Emails • Social networks 2
  • 3. Digital Trace Data • Records of activity (trace data) undertaken through an online information system (thus digital) • Increasingly common in studies of online phenomena • Large volumes of available data • Can be complete: a census, not a sample • May be more reliably recorded than other data 3
  • 4. Characteristics 1. Found data (not produced for research) 2. Event-based data (not summary data) 3. Events occur over time, so it is longitudinal data 4
  • 5. Requirements • Understand the original data source • How it was collected, potential problems • Limitations of the sample • What the data describe • Match with appropriate analysis methods and measures • New types of data may require new measures • Theoretical coherence is very important 5
  • 6. Advantages • Data may be “complete” • Usually no response bias (exception: cookies) • May cover long periods of time and large groups • Multiple different data types, but mostly textual • Data are often easy to acquire • APIs or scraping web pages (with caution) • Databases, archives, or repositories of research data • But remember: you usually get what you pay for! 6
  • 7. Disadvantages • Often difficult to know limitations of data • Data may be poorly documented • Original creator may not be available for comment • Volume of data can be overwhelming • Sampling strategies needed, e.g., temporal, random • Substantial time required for data preparation: 90% of effort • Exceptions are everywhere and will break analyses, but can only be discovered through trial and error 7
  • 8. Example: Email Networks • Data source: email listservs for FLOSS projects • Analysis approach: create social networks • Within discussion threads, individuals are nodes, and links are reply-to messages • Some conceptual issues for interpretation, choice of measures • Technical challenges • Temporal aggregation • Identity resolution 8
  • 9. Figures from Howison et al., 2006 Temporal Aggregation 9
  • 11. Network Results • Different levels of correlation between venues, suggesting different types of interactions • User venues more decentralized than developer venues, reflecting greater number of participants • Overall trend toward decentralization could be result of different influences • Observed anomalous patterns in trackers for both projects: periodic centralization spikes Cleaning up before shutting down • A single user makes batch bug closings (up to 279!) – Fire’s (feature request) tracker housekeeping appears to be preparation for project closure – Gaim’s tracker housekeeping was more regular and repeated 11
  • 12. Example: Classification • Replication of success-tragedy classification • Classification criteria originally drawn from interviews with community members • Data extracted from repositories • Technical challenges • Merging data from two repositories • Processing large volume of data in multiple steps 12
  • 13. Variables • Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads • Project statistics retrieved from repositories • Founding date • Data collection date • Dates for all releases • Number of downloads • URL 13
  • 15. Classification Results Class Original Our results Difference unclassifiabl 3 186 3 296 +110 e II 13 342 (12%) 16 252 (14%) +2 910 (+2%) IG 10 711 (10%) 12 991 (11%) +2 280 (+1%) TI 37 320 (35%) 36 507 (31%) -813 (-4%) TG 30 592 (28%) 32 642 (28%) +2 050 (0%) SG 15 782 (15%) 16 045 (14%) +263 (-1%) other 8 422 0 Total 119 355 117 733 15
  • 16. Thanks! • Questions? 16