1. Measuring progress
toward a cultural norm of
shared (and reused!)
biomedical research data
Heather Piwowar
Department of Biomedical Informatics
University of Pittsburgh
4. Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
5. Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
6. Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
7. Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
9. But... costly for authors
Find
Organize
Document
Deidentify
Format
Decide
Ask
Submit
Answer questions
Worry about mistakes being found
Worry about data being misinterpreted
Worry about being scooped
Forgo money and IP and prestige???
11. ... on initiatives, requests,
requirements, and tools
NIH data sharing plan requirement
Journal requirements
Public databases
Data sharing grids like BIRN and caBIG
Data formatting standards
Editorials, letters to the editor, discussion....
16. research questions
1. Is there benefit for those who share?
2. Do journal policies increase rates of sharing?
3. What other factors are correlated with
sharing and withholding data?
17. http://en.wikipedia.org/wiki/DNA_microarray
http://en.wikipedia.org/wiki/Image:Heatmap.png
http://commons.wikimedia.org/wiki/
File:DNA_double_helix_vertikal.PNG
microarray
data
20. currency of value?
Citations.
$50!
Diamond,Arthur M. What is a Citation Worth?.
The Journal of Human Resources (1986)
vol. 21 (2) pp. 200-215
21. Prior work focused on the citation
advantage of an open access
publishing model.
Our question: are articles that share
their raw research data cited more
than articles that don’t?
22. dataset
85 cancer microarray trials published in 1999-2003, as
identified by Ntzani and Ioannidis (2003)
citations
ISI Web of Science Citation index, citations from
2004-2005
data sharing locations
Publisher and lab websites, microarray databases, WayBack
Internet Archive, Oncomine
statistics
Multivariate linear regression
24. In multivariate regression, we found studies
that had made their data publicly available
received 69% more citations than similar
studies that did not share their data
(95% confidence interval: 18% to 143%)
Piwowar, Day and Fridsma (2007) Sharing Detailed Research
Data Is Associated with Increased Citation Rate.
PLoS ONE 2(3): e308
25. future work
• collect a larger dataset for citation
analysis (stay tuned)
• investigate other datatypes
• examine citation context
27. “An inherent principle of
publication is that others
should be able to replicate and
build upon the authors'
published claims. Therefore, a
condition of publication
in a Nature journal is that
authors are required to make
materials, data and associated
protocols available in a
publicly accessible database
…”
http://www.nature.com/authors/editorial_policies/availability.html
http://www.nature.com/nature/journal/v453/n7197/index.html
28. Prior work examined data sharing
policies in biomedicine, but these
reviews are now dated,
consider a variety of resources,
and don’t correlate policy to
behaviour.
McCain. Science Communication, Vol. 16, No. 4. (1 June 1995), pp. 403-431
NAS. Sharing Publication-Related Data and Materials. (2003), p. 33
29. Our aim: look at data sharing policies
within Instruction to Author
statements of 70 journals, as they
apply to gene expression microarray
data.
30. content of data sharing policies
Very diverse policies in terms of:
• statements of policy motivation
• datatype-specific policies
• requested vs. required
• data location
• data format
• data completeness
• timeliness of sharing
• consequences for not sharing
• exceptions
31. strength of data sharing policies
No applicable policy (43%)
Weak policy (24%)
should, recommend, request
must, but without database accession number
Strong policy (33%)
must, required, condition of publication
requires database accession number
34. data sharing policies
associated with amount of sharing
For each of the 70 journals,
we measured the percent of articles that
were cited from within GEO and
ArrayExpress.
We considered this a proxy for percent of articles
with shared data.
36. • our corpus of “gene expression microarray” articles
may have included some that reused data and did not
themselves produce primary data
• these results should be considered preliminary, pending
a more precise filter (stay tuned)
http://www.flickr.com/photos/vlastula/300102949/
37. future work on journal policies
• use a more precise filter to isolate
data producing articles and thereby
understand the absolute levels of data
sharing
• investigate other datatypes
• look at associations with reviewer
instructions and opinions
38. future work on funder policies
• are they effective? (stay tuned)
• what do people propose in data
sharing plans? Do they do what they
propose? Why not?
• quantify the perceived worth of data
sharing plans and accomplishments in
funding and promotion decisions
40. Prior work has focused on surveys and
studies of intention.
Our aim: measure associations between
observed data sharing behaviour and
environmental variables
Blumenthal et al. Acad Med. 2006
Campbell et al. JAMA. 2002
Kyzas et al. J Natl Cancer Inst. 2005
Vogeli et al. Acad Med. 2006
Reidpath et al. Bioethics 2001
41. pilot dataset
Ochsner et al. manually reviewed 20 journals for 2007:
400 studies
200 shared their microarray data
Ochsner et al. (2008). Much room for improvement in
deposition rates of expression microarray datasets. Nature
Methods, 5(12), 991.
42. pilot variables
Journal
Funder Journal Investigator
impact
mandates mandates “experience”
factor
Is research data shared
after publication?
43. funder mandates
NIH 2003 Data Sharing Requirement
Requires a data sharing plan
for studies funded after October 2003
that receive more than $500 000 in direct funding per year
44. funder mandates
Assumed data sharing requirement was applicable if:
the NIH grant numbers associated with PubMed entry had
$750 000 in total funding any year since 2004
plus
a NIH grant number with a leading “1” or “2” since 2004
45. author experience
Publication history and impact proxy
First and last authors:
• years since first paper
• h-index (the largest number N such that
an author has N papers cited at least N
times)
• a-index
46. author experience
Derived h-index (pubmedi citation indices):
Author publication
history:
Author name Author-ity web service
Torvik & Smalheiser. (2009). Author Name
disambiguation: Disambiguation in MEDLINE. ACM Transactions on
Knowledge Discovery from Data, 3(3):11.
Citation
counts:
47. pilot variables
Journal
Funder Journal Investigator
impact
mandates mandates “experience”
factor
Is research data shared
after publication?
49. results of pilot
Not statistically significant Statistically significant
Journal
Funder Journal Investigator
impact
mandates mandates “experience”
factor
Is research data shared
after publication?
57. PhD dissertation
More samples,
more variables
http://www.flickr.com/photos/krcla/2069243613/
58. More samples:
Developed and evaluated automated
methods to:
• Identify studies that generate datasets that
could potentially be shared
• Determine which of these have in fact been
shared
59. To identify studies that generate datasets,
use a query on the full text of published articles:
("gene expression" AND microarray AND cell AND rna)
AND (rneasy OR trizol OR "real-time pcr")
NOT (“tissue microarray*” OR “cpg island*”)
60. To determine which articles have shared data,
use a query on the full text of published articles:
pubmed_gds[filter] and query ArrayExpress
62. Funder Journal Investigator Institution Study
funded by impact years since sector humans?
NIH? factor first paper
size mice?
size of strength of h-index
grant policy impact plants?
a-index rank
sharing open cancer?
plan req’d? access? previously country
shared? clinical
funded by number of trial?
non-NIH? microarray previously
reused? number of
studies authors
published gender
year
65. research questions
1. Is there benefit for those who share?
2. Do journal policies increase rates of sharing?
3. What other factors are correlated with
sharing and withholding data?
69. who reuses data?
why?
when?
who doesn’t?
which datasets are most likely
to be reused?
how many datasets could be
reused but aren’t?
why aren’t they?
what can we do
about it?
70. One possible reuse research agenda
1. Inventory reuse acknowlegement patterns
2. Build full-text and metadata filters to identify
instances of data reuse
3. Analyze patterns in data reuse choices
4. Survey data producers and data consumers
to augment with intentions and perspectives
71. Resources
• GEO list of reuse
articles (currently 618)
• Previous work in citation context
classification
• Amazon Mechanical Turk for annotation
• Experimental Philosophy for insight into
cultural norms
• ... Teufel et al. (2006) Automatic classification
of citation function. EMNLP.
72. Stakeholders
• readers
• reusers For their perspectives,
• authors and also to design studies
that have actionable results
for these groups
• editors
• reviewers
• funders
• database designers, maintainers, curators
• patients, subjects, or populations
73. Data sharing plan
I post my data, code, and statistical scripts at
http://www.dbmi.pitt.edu/piwowar
Share yours too!
http://www.flickr.com/photos/myklroventine/892446624/
74. Dept of Biomedical Informatics at U of Pittsburgh
NLM for training grant funding
Open science online community and those who release their
articles, datasets and photos openly
Dr Wendy Chapman for her support and feedback
thank you
75.
76. “Does anyone want your data?
That’s hard to predict[…]
After all, no one ever knocked on your door asking to buy
those figurines collecting dust in your cabinet before you listed
them on eBay.
Your data, too, may simply be awaiting an effective
matchmaker.”
Got data? Nature Neuroscience 10, 931 (2007)
80. Correlates with self‐reported data
withholding
industry involvement
perceived competitiveness of field
male
sharing discouraged in training
human participants
academic productivity
0 1 2 3
Blumenthal et al. Acad Med. 2006
81. Self‐reported reasons for data
withholding
sharing is too much effort
want student or jr faculty to publish more
they themselves want to publish more
cost
industrial sponsor
confidentiality
commercial value of results
0% 20% 40% 60% 80%
Campbell et al. JAMA 2002.
82. Prevalence of data withholding
via surveys
self-reported denying a request in last 3 years
trainees self-reported denying a request
been denied access to data, materials, code
authors “not able to retrieve raw data”
not willing to release data
0% 10% 20% 30% 40%
Campbell et al. JAMA. 2002.
Kyzas et al. J Natl Cancer Inst. 2005.
Vogeli et al. Acad Med. 2006.
Reidpath et al. Bioethics 2001.