11. “The Tribune’s biggest magnet
by far has been its more than
three dozen interactive
databases, which collectively
have drawn three times as many
page views as the site’s stories.”
http://bit.ly/dj2dmz
24. Advanced search by file type
“Performance figures” Filetype: pdf
Filetype: xls
Filetype: doc
Filetype: ppt
Filetype: rdf OR xml
25. Advanced search by domain
“Disclosure logs” site: .gov.es
Database site: .org.cat OR .org
+Tables –chairs site:
Health, police, military domains
26. Use overseas sources
• US medicine databases
• EU subsidy databases
• Swedish people data
• International police agency
correspondence
27. Scraping
Scraping can automate & schedule the
gathering process if there are multiple
sources
Tools: OutWit Hub plugin, Yahoo! Pipes,
Scraperwiki, Google Spreadsheets
formulae
30. Different words for the same thing
Double spaces, punctuation
Wrong data type
Mistyped
Duplicate entries
Default entries (1/1/00)
...Saves time later
31. "Because we take the time to clean the
data, we are able to do lobbying stories
no other news organisation can do."
David Donald,
Center for Public Integrity
32. Group by term then sort to see
duplications
Find & replace double spaces, etc.
Select column/row & check data type
Sort to find unusually large/small, and
neighbouring misspellings
Cleaning methods
33. Never publish a name from data without
running a background check
Check.
44. Geocoded data with map
- Live data (e.g. Twitter API)
- Static data (e.g. Google Docs)
- Dynamic data (e.g. Google Form)
2 spreadsheets with common data
- Tools: MySQL, Access, etc.
Combining data sources