1. Secondary data analysis
with digital trace data
Examples from FLOSS research
Andrea Wiggins
13 Juillet, 2011
2. Secondary Data Analysis
• Uses existing data produced or collected by
someone else, usually for a different purpose
• Databases
• Repositories
• Surveys
• Emails
• Social networks
2
3. Digital Trace Data
• Records of activity (trace data) undertaken through
an online information system (thus digital)
• Increasingly common in studies of online
phenomena
• Large volumes of available data
• Can be complete: a census, not a sample
• May be more reliably recorded than other data
3
4. Characteristics
1. Found data (not produced for research)
2. Event-based data (not summary data)
3. Events occur over time, so it is longitudinal data
4
5. Requirements
• Understand the original data source
• How it was collected, potential problems
• Limitations of the sample
• What the data describe
• Match with appropriate analysis methods and measures
• New types of data may require new measures
• Theoretical coherence is very important
5
6. Advantages
• Data may be “complete”
• Usually no response bias (exception: cookies)
• May cover long periods of time and large groups
• Multiple different data types, but mostly textual
• Data are often easy to acquire
• APIs or scraping web pages (with caution)
• Databases, archives, or repositories of research data
• But remember: you usually get what you pay for!
6
7. Disadvantages
• Often difficult to know limitations of data
• Data may be poorly documented
• Original creator may not be available for comment
• Volume of data can be overwhelming
• Sampling strategies needed, e.g., temporal, random
• Substantial time required for data preparation: 90% of effort
• Exceptions are everywhere and will break analyses, but can
only be discovered through trial and error
7
8. Example: Email Networks
• Data source: email listservs for FLOSS projects
• Analysis approach: create social networks
• Within discussion threads, individuals are nodes, and links
are reply-to messages
• Some conceptual issues for interpretation, choice of
measures
• Technical challenges
• Temporal aggregation
• Identity resolution
8
11. Network Results
• Different levels of correlation
between venues, suggesting different
types of interactions
• User venues more decentralized than
developer venues, reflecting greater
number of participants
• Overall trend toward decentralization
could be result of different influences
• Observed anomalous patterns in trackers for
both projects: periodic centralization spikes
Cleaning up before shutting down
• A single user makes batch bug closings
(up to 279!)
– Fire’s (feature request) tracker housekeeping
appears to be preparation for project
closure
– Gaim’s tracker housekeeping was more
regular and repeated
11
12. Example: Classification
• Replication of success-tragedy classification
• Classification criteria originally drawn from
interviews with community members
• Data extracted from repositories
• Technical challenges
• Merging data from two repositories
• Processing large volume of data in multiple steps
12
13. Variables
• Inputs: project names and 5 threshold values for
classification tests, e.g. number of downloads
• Project statistics retrieved from repositories
• Founding date
• Data collection date
• Dates for all releases
• Number of downloads
• URL
13