Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.
3. https://osf.io/cgpzb/
Open Science (Open Access & Open
Data) survey of Hong Kong
Reading/Reflection
Most people mentioned training of librarians:
Tak Hei Lam: “Training should be provided to librarians so that they have adequate
knowledge about data curation and provide professional support and advice for the
researchers to sharing of data. Also, librarians can provide training and workshop to
change the mindset of the researcher not to rely on the impact factor but on other to
other comprehensive research metrics such as PlumX”
Lijia Yu: At the same time, in big data era, the research will be increasingly migrating
to the cloud, so this should be done in an organized manner.
Lots of talk on incentive systems & policy, but little on infrastructure other than:
NEED FOR A PLAN/LEADERSHIP
4. HKU Repeatability in HK
Research Experiment
(homework)
Feedback?
What have we found?
9. Interesting examples
Lots of data in Dryad, but 1 H7N9 example isn’t resolving
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148506
10. Story so far
• HKU publishing a lot of survey based research in PLOS
• 3 examples from “Children of 1997” birth cohort. Access to data
involves emailing DAC
• External databases: 2 examples in Dryad data (one not working),
1 example in OSF, 1 example in scholarhub, lots in figshare
• So far 2 have data with broken URLs, 1/3 are controlled access,
1/4 have summary but not raw data
12. Research Data 1665?
Scholarly articles are merely advertisement of scholarship . The
actual scholarly artefacts, i.e. the data and computational
methods, which support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab
and reproducible research, 1995
13. Esoteric formats, poorly structured,
Tabular, often spreadsheet based
Issues open data community well used to
(data cleaning, scraping, etc.,)
The long tail of scientific data…
14. Science Data Volumes
Exabytes Petabytes100’s of Petabytes
Sequencing
Mass Spec
Astrophysics HE Physics Biology
Imaging
Square Kilometer Array
Large Hadron Collider
15. Big Data in Healthcare
http://dx.doi.org/10.1186/s13742-016-0117-6
16. Big Data in Healthcare: challenges
• 80% of health data unstructured (100’s of
forms/formats)
• Medical Imaging archives increasing 20-40% per year
• Genomics data will increase data volumes exponentially
• Patients expect extra privacy protection if they are
going to fully participate in data driven research
Source: https://www.healthcare.siemens.com/magazine/mso-big-data-and-healthcare-2.html
17. Open Data in Physics
1961 CERN pre-prints shelf
http://cerncourier.com/cws/article/cern/28654
http://arxiv.org/
1991-date arXiv
18. Open Data in Earth Sciences
https://pangaea.de/Established 1987 (online since 1995)…
19. Open Data in Earth Sciences
#Climategate UAE emails “scandal”
Is it possible to be too open?
25. Other Ome(s): mass spectrometry data
https://en.wikipedia.org/wiki/Mass_spectrometry
Nadina Wiórkiewicz
26. Rise of mass spectrometry data
https://doi.org/10.1093/nar/gkv1352
27. Challenges: Rise of big imaging data
http://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3222.html
28. Challenges: Rise of big imaging data
https://openi.nlm.nih.gov/detailedresult.php?img=PMC3171117_JCB_201108095_RGB_Fig2&req=4
http://journals.sagepub.com/doi/10.1177/1087057114528537
HCS: High Content Screens
AKA High Throughput
Screening: High volumes,
growing uptake – TBs of
data
New ways of
sharing/publishing data
with OMERO/JCB data
viewer
32. Genomics Data Sharing Policies…
1. Automatic release of sequence assemblies within 24 hours.
2. Immediate publication of finished annotated sequences.
3. Aim to make the entire sequence freely available in the public domain for
both research and development in order to maximise benefits to society.
Bermuda Accords 1996/1997/1998:
1. Sequence traces from whole genome shotgun projects are to be
deposited in a trace archive within one week of production.
2. Whole genome assemblies are to be deposited in a public nucleotide
sequence database as soon as possible after the assembled sequence
has met a set of quality evaluation criteria.
Fort Lauderdale Agreement, 2003:
The goal was to reaffirm and refine, where needed, the policies related to
the early release of genomic data, and to extend, if possible, similar data
release policies to other types of large biological datasets – whether from
proteomics, biobanking or metabolite research.
Toronto International data release workshop, 2009:
35. Scaling up of sharing: 1000 genomes
http://www.internationalgenome.org/
36. Three decades of sharing infrastructure: INSDC
http://www.insdc.org/
37. Sharing aids individuals
Piwowar HA, Day RS, Fridsma DB (2007)
PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308
Sharing Detailed Research
Data Is Associated with
Increased Citation Rate.
Every 10 datasets collected contributes to at least 4 papers in the
following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473
(7347), 285-285 DOI: 10.1038/473285a
43. There are over 860 databases &
675 standards in the life sciences
Formats Terminologies Guidelines
44. Guidelines = Minimum information
reporting requirements, checklists
o Report the same core, essential
information
o e.g. ARRIVE guidelines
Terminologies = Controlled
vocabularies, taxonomies,
thesauri, ontologies etc.
o Unambiguously refer to
an entity
o e.g. Gene Ontology
Models/Formats = Conceptual
model, conceptual schema,
exchange formats
o Allow data to flow from
one system to another
o e.g. FASTA
Enablers: to better describe,
share and query data
Formats Terminologies Guidelines
46. Exercise: Use Biosharing to answer the following?
To share your work are there standards you should follow?
Are there specialized curated databases you can use?
A. You work in the area of functional MRI
imaging and are producing 100’s of GBs
of fMRI brain scan data.
B. You are an immunologist using flow
cytometry to sort cells.
C. You are a chemist looking at the 3D
crystal structure of proteins using NMR
https://biosharing.org/
Potential collaborators would like to use your data.
Sabban, Sari
55. https://opensource.org/licenses
Open Source v Open Data Licenses
Same ethos (open source begat open data), different contexts
• OSS designed for continuing development, OD for making
objects available
• IP issues. Software can be patented, data (generally) can’t
• More business models for software than data (so far…)
• Wider selection of OSS licenses, and more options to fine-
tune access (Linking, Distribution, Modification, Sublicensing,
Patents/Trademarks, etc.)
56. • Now researchers are producing such large &
heterogeneous datasets, what do you think the
challenges are for producers and users?
• What are the legal implications of mixing data and
software?
• What do you think the security issues of accessing
these complex combined research objects are?
Questions to ask?
58. Research Data: Pop Quiz
What was #climategate?
What is the INSDC, and who are the three INSDC partners?
What is the estimated yearly growth of medical imaging data?
What are bioboxes?
How many databases are currently listed in biosharing?
Which of the reporting guidelines/checklists are for A) animals, B)
biological science, and C) clinical research: MIBBI, ARRIVE and
Equator
62. Ethics: need informed consent
http://www.med.hku.hk/images/document/04research/institution/5QMH_IRB_GUIDAN
CE_NOTES_FOR_THE_PREPARATION_OF_PATIENT_CONSENT.pdf
Where does data sharing fit into this?
WILL MY TAKING PART IN THIS STUDY BE KEPT CONFIDENTIAL?
You will need to obtain the patient’s permission to allow restricted access to
their medical records and to the information collected about them in the
course of the study. You should explain that all information collected about
them will be kept strictly confidential. A suggested form of words is:
“All information which is collected about you during the course of the research will
be kept strictly confidential. Any information about you which leaves the
hospital/surgery will have your name and address removed so that you cannot be
recognised from it.”
HKU Guideline Notes - for Preparation of Subject Information
Sheet & Informed Consent Form:
63. Ethics: includes animal research
http://www.med.hku.hk/research/research-ethics/animal-ethics-culatr
66. Lots of tools available: anonymisation
https://www.ukdataservice.ac.uk/manage-data/tools-and-templates
67. Lots of tools available: encryption
https://www.brookes.ac.uk/Research/Research-ethics/Encrypting-files/
68. Lots of tools available: DAC & brokering
https://blog.repositive.io/getting-data-out-of-the-ega/
69. Lots of tools available: DAC & brokering
http://www.ckbiobank.org/site/
70. Lots of tools available: DAC & brokering
http://www.ckbiobank.org/site/
71. Kinds of identifying information
• Direct identifiers
– Names, addresses, postcode information,
telephone numbers or pictures
• Indirect identifiers
– In combination with other information, would
identify e.g. information on workplace, occupation
or exceptional values of characteristics like salary
or age
http://www.data-archive.ac.uk/create-manage/consent-ethics/anonymisation
73. Anonymising audio-visual data
• Anonymisation of audio-visual data, such as editing of digital
images or audio recordings, should be done sensitively. Bleeping
out real names or place names is acceptable, but disguising voices
by altering the pitch in a recording, or obscuring faces by pixellating
sections of a video image significantly reduces the usefulness of
data. These processes are also highly labour intensive and
expensive.
• If confidentiality of audio-visual data is an issue, it is better to
obtain the participant's consent to use and share the data
unaltered. Where anonymisation would result in too much loss of
data content, regulating access to data can be considered as a
better strategy.
• We urge researchers to consider and judge at an early stage the
implications of depositing materials containing confidential
information and to get in touch to consult on any potential issues.
https://www.ukdataservice.ac.uk/manage-data/legal-
ethical/anonymisation/qualitative
74. Considerations for medical imaging
https://openfmri.org/de-identification/
https://sourceforge.net/projects/privacyguard/
Need to also ensure
DICOM (Digital Imaging
and Communications in
Medicine) metadata also
passes through de-
identification toolkit
MRI brain scans first
undergo skull stripping
Automated Defacing Tools
required beyond this
75. Considerations for medical images
https://bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-15-21
https://bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-11-26
• Sharing of clinical images crucial in understanding “phenotypes”
• Require ”consent to publish”, but challenges doing this with ill
people, children, elderly, and disadvantaged
• Further challenges in era of social media, open access and wikipedia
• Security issues protecting signed consent forms
76. Not just a metadata problem…
http://science.sciencemag.org/content/339/6117/321
77. Extra considerations for HK
Hospital Authority restrictions on data
• Have to apply to Hospital Authority to access public health data
• Only approved 14 data requests (as of May 2016)
• If approved requires data recovery charges (collect $250,000
HKD a year from this)
• Can publish aggregate/summary data in journals, but not share
data
• Only approves academic use, not citizens or industry/pharma
Via FOI request: https://accessinfo.hk/en/request/request_for_statistics_on_data_c
78. Extra considerations for China
Human genetic data needs MOST approval
Article 2: The term "human genetic resources" in the Measures refers to the genetic materials such
as human organs, tissues, cells, blood specimens, preparations of any types or recombinant DNA
constructs, which contain human genome, genes or gene products as well as to the information
related to such materials.
Second, any international collaborative project involving Chinese human genetic resources, for
example international research cooperation and exporting human genetic resources or taking such
resources outside of the territory of China should shall apply to MOST for examination and approval
prior to entering into an official contract. And Chinese collaborating party shall be responsible for
going through the due formalities of application for approval. (See Article 11)
80. Can this data be easily de-identified & shared?
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152381
“The individual in this manuscript has given written informed consent
(as outlined in PLOS consent form) to publish their images. Following
approval by the Institutional Review Board (IRB) of The University of
Hong Kong and Hospital Authority Hong Kong West Cluster (UW 14–
159); 20 individuals, 10 male and 10 female volunteers, were properly
instructed and gave consent to participate in this study by signing the
appropriate informed consent paperwork. “
81. FAIR or unfair? Principled publishing for data. 公平
或者不公平? 数据发表的原则
82. What is FAIR (公平的)?
Adverb
Without cheating or trying to achieve unjust advantage.
‘no one could say he played fair’
Adjective
Treating people equally without favouritism or discrimination.
‘the group has achieved fair and equal representation for all its
members’
‘a fairer distribution of wealth’
fair /fɛː/
83. 475, 267 (2011)
http://www.nature.com/news/2011/110720/full/475267a.html
“Wide distribution of information is key to scientific progress, yet
traditionally, Chinese scientists have not systematically released
data or research findings, even after publication.“
“There have been widespread complaints from scientists inside and
outside China about this lack of transparency. ”
“Usually incomplete and unsystematic, [what little supporting data
released] are of little value to researchers and there is evidence that
this drives down a paper's citation numbers.”
Is this FAIR? 这是FAIR?
84. FAIR questions to ask?
Is the raw data publically available?
Are the reagents (plasmids, cells,
antibodies, etc.) available?
Are detailed protocols available?
Can I access the processed data &
results (supporting the figures)?
Was this all available BEFORE
publication to the peer reviewers?
Can I inspect the peer reviews?
Can I publish/link +/-ve replication
experiments to this?
86. Research Objects: a concept & model
http://www.researchobject.org/
• Supporting publication of more than just PDFs, making data, code, & other resources first class citizens
of scholarship.
• Recognizing that there is often a need to publish collections of these resources together as one
shareable, cite-able resource.
• Enriching these resources and collections with any & all additional information required to make
research reusable, & reproducible!
87. Importance of metadata: context (& discoverability)
https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming
https://twitter.com/AlisonMcNab/status/751375987624009728/photo/1
?
88. Novel tools/formats for data interoperability/handling: ISA
Importance of metadata: context (& discoverability)
89. Where do you set it?
Experiment
(e.g. International
Cancer Genome
Consortium)
Datasets
(e.g. cancer type)
Sample
(e.g. specimen xyz)
e.g. doi:10.5524/100001
e.g. doi:10.5524/100001-2
e.g. doi:10.5524/100001-2000
or doi:10.5524/100001_xyz
Smaller still?
Importance of granularity
Papers
Data/
Micropubs
NanopubsFacts/Assertions (~1013 in literature)
94. A mnemonic to remember: FAIR
一个帮助记忆的词语:FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
Findable 可发现的
Accessible 可得到的
Interoperable能共同使用的
Reusable 可以再度使用的
Lots of models/standards/guidelines
Where does that leave us?
95. A mnemonic to remember: FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
96. A mnemonic to remember: FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
To be Findable:
F1. (meta)data are assigned a globally unique and persistent
identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the
data it describes
F4. (meta)data are registered or indexed in a searchable resource
97. A mnemonic to remember: FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
To be Accessible:
A1. (meta)data are retrievable by their identifier using a
standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization
procedure, where necessary
A2. metadata are accessible, even when the data are no longer
available
98. A mnemonic to remember: FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
To be Interoperable:
I1. (meta)data use a formal, accessible, shared, and
broadly applicable language for knowledge
representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other
(meta)data
99. A mnemonic to remember: FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
To be Reusable:
R1. meta(data) are richly described with a plurality of accurate and
relevant attributes
R1.1. (meta)data are released with a clear and accessible data
usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards
101. Beyond a mnemonic: FAIR ecosystems
• A particular class of FAIR Data System to provide support for
data interoperability;
• Supports publication, search and access to FAIR data.
• Fosters an ecosystems of applications and services;
• Federated architecture: different FAIRports (and other FAIR Data
Systems) are interconnectable;
• Supports citations of datasets and data items;
• Provides metrics for data usage and citation;
A ‘FAIRpoint or FAIRport’ can be any specific data instance following FAIR data
principles.
http://www.datafairport.org/
103. Beyond a mnemonic: FAIR ecosystems
https://www.fair-access.net.au/fair-statement
“By 2020, Australian publicly funded researchers and research organisations
will have in place policies, standards and practices to make publicly funded
research outputs findable, accessible, interoperable and reusable.”
104. DTL/ELIXIR-NL
“Bring Your Own Data Party”
GigaScience/BGI HK
Metabolomics ISA-TAB athon v
More FAIR mnemonics: “BYODs”
105. FAIR Data in the wild
Taking a microscope to the
publication process
107. How FAIR can we get?
如何获取FAIR?
Data sets
Analyses
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>50,000 accesses
& 885 citations
Open-Code
7 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/
>40,000 downloads
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
109. The SOAPdenovo2 Case study
Subject to and test with 3 models:
Data
Method/Experi
mental protocol
Findings
Types of resources in an RO
ISA-TAB/ISA2OWL
Nanopublication
Wfdesc/ISA-
TAB/ISA2OWL
Models to describe each resource type
110.
111. 1. While there are huge improvements to the quality of the resulting
assemblies, other than the tables it was not stressed in the text that
the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo
v1.
2. In the testing an assessment section (page 3), based on the correct
results in table 2, where we say the scaffold N50 metric is an order of
magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was
actually 45 times longer
3. Also in the testing an assessment section, based on the correct
results in table 2, where we say SOAPdenovo2 produced a contig N50
1.53 times longer than ALL-PATHS, this should be 2.18 times longer.
4. Finally in this section, where we say the correct assembly length
produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1,
this should be 3-64 fold longer.
112. Lessons Learned 经验教训
• Most published research findings are false. Or at
least have errors
• With enough effort is possible to push button(s) &
recreate a result from a paper with current tools
• Being FAIR can be COSTLY. How much are you willing
to spend? Who will build FAIR infrastructure?
• Much easier to make things FAIR before rather than
after publication. BYODs useful intermediate here
115. Levels of FAIRness: A-F of FAIR data
In class activity: How FAIR is this data?
1. Data from: Live poultry exposure and public response to influenza
A(H7N9) in urban and rural China during two epidemic waves in
2013-2014 http://hub.hku.hk/cris/dataset/dataset93128
1. Supporting data for "Genomic analyses revealFAM84B and the
NOTCH pathway are associated with the progression of esophageal
squamous cell carcinoma” http://dx.doi.org/10.5524/100181
1. Linked Drug-Drug Interactions (LIDDI)
https://datahub.io/dataset/linked-drug-drug-interactions-liddi
http://content.iospress.com/articles/information-services-and-use/isu824
116. Reflection: how fair is FAIR?
Read the FAIR principles paper.
Do you think they are applicable and
feasible for HK? If it is feasible, what is
needed to implement them?
http://www.nature.com/articles/sdata201618
118. Final Project
• For the final project for this course, you can
choose from 3 assignment options.
• The assignment is due on the 15th May and it is
worth 40% of your grade.
• Time will be set aside for presenting on this
during the final class on the 24th April: covering
why you chose the option, what
discipline/dataset/topic you are covering, and
what work you've done so far (5 mins per student
including any group feedback)
119. Final Project: Option 1
Write an Annotated Bibliography about data curation practices in an academic
discipline of your choosing.
• Choose a discipline (sciences, social sciences, & humanities) OR choose the topic of
“open data.”
• Summarize data practices in your chosen discipline or topic. (5-7 sentences)
• Find 7-10 sources that relate that discipline or topic to data creation, management,
and/or curation.
• Provide a citation for the source in APA style.
• Write a short annotation that summarizes the content of the source. You may
include quotes from the source sparingly, but the annotations should be mostly, if
not entirely, in your own words. (3-5 sentences)
• Explain the relevance of the source with relation to the data practices of your
chosen discipline or topic. (1-2 sentences)
• Find a few example public datasets to demonstrate the above points. Cite the data
in the relevant places in the Bibliography according to the Data Citation Principles.
• Refer to this guide for more information about annotated bibliographies:
http://sites.umuc.edu/library/libhow/bibliography_tutorial.cfm. Your annotation
should be in the “Descriptive” style.
120. Final Project: Option 2
Using a relevant dataset (this can either be from the literature curation
exercise, a BYO dataset, or one given to you), write a report that includes a
description of the dataset, a Data Management Plan, and a guidelines
document for the researcher(s).
• Describe the dataset that explains the form of the data and the academic discipline in which it
was created. This paragraph should provide context for the (3-5 sentences) 1-2 page Data
Management Plan following the guidelines from HKU or granting body such as NSF.
• 1 page guidelines document that could be presented to the researcher(s) that provides
guidelines for their data (extant and forthcoming):
– Preservation
– Appraisal
– Documentation
• For the DMP and the guidelines document, you can extrapolate from the your dataset to
imagine additional details about the research practices that created the dataset and will create
more data in the future.
• Look for suitable data repositories that can host this data (institutional, general purpose, or
subject specific), and if there is one relevant then publish the data if you have permission, and
correctly cite the data in the relevant places in your report. [disclaimer: if have permission]
121. Final Project: Option 3
Prepare a 30 minute data curation workshop that you could teach to
researchers that would provide them the necessary details to understand why
data curation is relevant to them and best practices they should follow.
• Slide deck that introduces data curation for a researcher audience. (No
more than 40 slides.)
• Presenter outline that describes the important points for each slide.
• Topics that might be addressed in your workshop: the value of data
management, writing a data management plan, data repository options.
You can assume your audience is researchers are at HKU.
• Make sure all of the content is copyright free, and share the final material
openly (e.g. figshare, scholarhub, OER commons, etc.), and with sufficient
metadata to make it discoverable.
122. Looking ahead…
• Submit 1 paragraph refection on FAIR principles
through moodle forum
• Next class (22nd April) is hands-on curation
workshop with Dr Chris Hunter
– Bring laptops and any data you may have for a data
cleaning exercise
• Final project due 15th May
– Need to present preliminary version on 26th April to
get feedback before completion. Send me slides by
the 25th April so I can get them ready for the class