Finals of Kant get Marx 2.0 : a general politics quiz
Research Data Management
1. Publishing your Research
Research Data Management
for PGRs
Sebastian Pałucha
Research Data Manager
sebastian.palucha@durham.ac.uk
2. Session outline
- Slides and demonstrations - http://bit.ly/durdm14
- What... is “Research Data”?
- Small group activity
- What... is “Research Data Management” ?
- Data life cycle
- Why... Is Research Data Management important?
- Drivers for change, Requirements on & benefits for researchers
- How... to manage and secure research data
- Data Management Planning. Document storage and back-up
- How... To share data
- Benefits of sharing data and tools available
5. Results from experiments
Historical Diaries
Physical objects
.pdf
Algorithms Simulation software
.rtf
Lab & Field notebooks
Questionnaires
35mm
Interview transcripts
.gif
.docx
Database
Government records
.xls
IX240
Correspondence
.xml
Test answers
Specimens
Photographs
.spss
Models
.jpg
Methodologies & workflows Film footage
6. Research data is …
• Diverse
• Multi-sourced
• Valuable & costly
• Sensitive
• Massive
• Complex
• Derivative
• Communities based
• Driven by
information
intensive processes
8. Data is situational
Ship logbooks :
- historical record of events
- data to reconstruct weather
patterns
- data on naval personnel
(genealogical / demographic)
- extrapolation of data on
ration provisions etc.
9. Data is situational
Data can be used and re-used …
... for purposes you may not have
thought of…
... even after you have extracted
all the value you need from it.
10. Research Data
There is no single or simple
definition of what constitutes
‘research data’
- it is used to support the production or
validation of original research.
- it can be ‘born digital’, or it can be
analogue (and then digitised)
- it is situational...
CC nc nd JISC
12. Where is your data?
JISC RDM Survey
- Russell Group institutions average over
2PB of data
- significant data storage on hard drives
internal/external, flash drive, web etc.
- 23% of institutions had lost research data
- how would this impact you?
14. What is Research Data
Management?
“ Research data management concerns
the organisation of data, from its entry to
the research cycle through to the
dissemination and archiving of valuable
results.”
Whyte, A., Tedds, J. (2011). ‘Making the Case for Research Data
Management’. DCC Briefing Papers.
http://www.dcc.ac.uk/resources/briefing-papers/making-case-rdm
15. Data Life-cycle
UK Data Archive www.data-archive.ac.uk/create-manage/life-cycle
19. OECD Data Principles
• Promoting Access to Public
Research Data for Scientific,
Economic and
Social Development:
– Publicly-funded research data are a public good,
produced in the public interest.
– Publicly-funded research data should be openly
available to the maximum extent possible
20. “Organisations … assist researchers in the
accurate and efficient collection of data and its
storage in a secure and accessible form.
Researchers … how data will be gathered,
analysed and managed … what form relevant
data … made available to others … ”
21. ESRC Data Policy
• … advocates open access to all research
output … available to the scientific community
in a timely and responsible manner
• … use of existing standards for data
management and to make data available for
further re-use
http://www.esrc.ac.uk/about-esrc/information/data
policy.aspx
22. … it is a requirement
“Publicly funded research data are a public
good, produced in the public interest, which
should be made openly available with as few
restrictions as possible in a timely and
responsible manner that does not harm
intellectual property.”
RCUK Common Principles on Data Policy
http://www.rcuk.ac.uk/research/Pages/DataPolicy.aspx
23. Durham RDM Policy
• Storage and management
- Registered creation
- Adequate metadata
- Backed up
• Retention
- 3 years or 10 years
- Registered disposal
• Access
- responsible release of data
https://www.dur.ac.uk/research.office/research-outputs/
research-data-management/policy/
26. … boosts your profile
“… 10,555 studies … we found that studies that
made data available in a public repository
received 9% more citations than
similar studies for which the data was not made
available …”
Piowar H. & Vision T. (2013)
“Data reuse and the open data citation advantage”
PeerJ, http://peerj.com/articles/175/
27. … data can be re-used
“The best thing to do with your data will be thought
of by someone else”
Rufus Pollock, co-founder Open Knowledge Foundation
• Impact and innovation
• Collaboration and teaching
The Twitter Archive at the Library of Congress
Iris Flower Data Set, Fisher, R.A. 1936
1000 Genomes – A Deep Catalogue of Human genetic Variation
• Research integrity
- identifying errors Debt and Growth - Reinhart-Rogoff revisited
- fraud in research Deception at Duke - Fraud in cancer care
28. Why manage research
data?
• You are increasingly likely to be required to
• It is good research practice
- to defend your research publications
- to secure against loss of data
• It boost your citation potential
• Your data can be re-used and replicated
30. ESRC data management plan -
common themes
• Assessment of existing data
• Information on new data
(i.e. volume, type, format)
• Quality assurance of data
• Backup and security of data
(day to day practices)
• Expected difficulties in data sharing
(confidentiality, ethics and consent, the FOI and DP acts)
• Copyright/intellectual property right
(IPR, 3rd copyright, embargoes)
• Responsibilities
• Preparation of data for sharing and archiving
(where, who and when will have access)
32. Who is involved?
• PI and CIs
(initiate discussion in timely manner)
• Research Office
(funders requirements)
• CIS research support
(active and long-term storage, back up)
• Librarians / archivist
(OA, copyright, metadata standards, curation)
• Data Protection and FOI officers
(sensitive data)
• Technical and laboratory staff
(data provenance)
34. Organising your data
• plan a hierarchy of files and folders, organised
by…
- type of data (text, image, model, sound, video etc.)
- type of research activity (survey, interview etc.)
- type of material (documentation, publication, etc.)
35. Organising your data
• Be systematic and consistent with naming
conventions and housekeeping from the start…
- files should be sortable by name
- filenames should indicate the ‘version’
- filenames should be easily distinguishable
36. Thinking about filenames
• Names should be clear and descriptive
- to both you, and third parties
37. Thinking about filenames
• Consider including elements in filenames…
- Date 2014_01_14
- Project identifier PGCAP
- Content description RDM_presentation
- Version v1.2
2014_01_14-PGCAP-RDM_presentation-v1.2.pptx
CARD/2014_11_13-RDM_presentation-v1.4.pptx
38. Thinking about filenames
• Pitfalls to avoid
- Whitespace
- Unsupported characters in filenames
- Capitalisation
2014_01_14-PGCAP-RDM_presentation-v1_2.pptx
2014_01_14-pgcap-rdm_presentation-v1_2.pptx
41. Data about Data
• To keep track of data…
• … and to describe what data is available to a
secondary user
• Spreadsheet?
• Lab notebook?
- electronic / paper?
• Database?
47. Data formats
• Think about what format you are saving your data
in…
Prefer this… … over this
ASCII (human readable)
(.txt, .xml, .csv )
Binary formats
(.exe, .doc, )
Open standard
.odt
.ods
Proprietary
.docx
.xlsx
49. Data back-ups
• Are you just digitising / photocopying?
• Are you saving files into in multiple locations
(pendrives, hard drive, external hard drive?)
• Tip for Durham Research Students:-
- (stevens)(j:) your Durham network drive
• Other tools available:
– SyncToy,, Time Machine,, Deja Dup
50. Data security
• Passwords
• Do you use passwords >8+
– Password Vault use 128 – 256
– Public Key Encryption (PKI) 1024+
(GPG, GPG4Win, GPGTools)
• Virtual Encrypted Drive
– TrueCrypt, FileVault
51. Data security
• Secure Interent Protocols
– WiFi: WPA2 but not WEP sd
– Browser: HTTPS
– Virtual Private Network
(VPN)
– Secure Shell
(SSH)
• How to access j: drive of campus
– VPN, DU MDS Anywhere
55. Sharing your data
• Who owns the copyright in the data?
• The University owns copyright by default
- Can be subject to funders requirements
• 3rd party copyright
- Database software
• Data licensing
- Open Data Commons http://opendatacommons.org
- Creative Commons https://creativecommons.org
- Open Government Licencehttp://data.gov.uk/blog/new-open-government-
license
57. How to share data responsibly
Information Commissioner’s Office published
guides on data security and sharing:
• Is the sharing justified?
– Benefits and risks
• Do you have power to share?
– Participant Consents
– Intellectual property Rights
• Is sharing conditional?
– How we will protect data
http://ico.org.uk/for_organisations/data_protection/topic_guides/data_sharing
http://ico.org.uk/for_organisations/data_protection/topic_guides/anonymisation
59. Publishing Data
Essential properties:
• Identifier; Creator; Title; Publisher; Publication year
Optional properties:
• Resource type; Version
60. Summary
Research Data
Management
good practices
Reliable Data Storage
preservation
Academic Culture
61. Managing And Sharing Research
Data – A Guide to Good Practice
http://ukdataservice.as.uk/manage-data/handbook.aspx
62. Inter University Consortium for
Political and Social Research
http://www.icpsr.umich.edu/icpsrweb/content/deposit/guide/index.html
63. • Scientists need to be more open among themselves and with the public and media
• Greater recognition needs to be given to the value of data gathering, analysis and communication
• Common standards for sharing information are required to make it widely usable
• Publishing data in a reusable form to support findings must be mandatory
• More experts in managing and supporting the use of digital data are required
• New software tools need to be developed to analyse the growing amount of data being gathered
64. Open Science
“With the right formats, licensing and distribution mechanisms,
people can easily collaborate over data, enhance the analysis
and re-purpose for their own needs.” Cameron Neylon
http://cameronneylon.net/blog/fork-merge-and-crowd-sourcing-data-curation/
66. Image Credits
[3] Via Flickr Creative Commons, and by Eric Fischer: Available at
http://www.flickr.com/photos/24431382@N03/4671562937
[8] Via Flickr Creative Commons, and by L. Whittaker: Available at
http://www.flickr.com/photos/7577311@N06/1490557341
[13] Via Flickr Creative Commons, and by FutUndBeidl: Available at
http://www.flickr.com/photos/61423903@N06/7369580478
[16] Via Flickr Creative Commons, and by barks photo stream: Available
at http://www.flickr.com/photos/49503168860@N01/4257136773
67. Image Credits
[26] Via Peter Murray-Rust blog: Available at
http://blogs.ch.cam.ac.uk/pmr/2011/08/01/why-you-need-a-data-management-
plan
[30] Via Flickr Creative Commons, and by Darwin Bell: Available at
http://www.flickr.com/photos/darwinbell/1454251440/
[57] Via Martin Missfeldt bildersuche.org :Available at
http://www.bildersuche.org/en/creative-commons-infographic.php
[57] Via foter.com : Available at
http://teachthought.com/technology/the-ultimate-visual-guide-to-creative-
commons-licensing/
Hinweis der Redaktion
Version 1.4 delivered 23/06/2014 – This version was targeted for 40 min delivery without practical guidance mostly based on v 1.3 with some small additions, information intensive, RMD force field, example datsets
Version ESRC_Futher_Leaders delivered 23/06/2014 – This version was targeted for 20 min delivery. based on v 1.4 re-thinked and graphically redone with plenty of more requirements covered. Particularly tuned to EPSRC requirements, including Opne Scince Theme.
Version 1.5 (17/11/2014) based on v 1.4 – This version is targeted for 1:30 min delivery. Will include slides from ver. 1.3 and new practical examples.
Emphasis: - this session is an introductory, awareness session. - not the aim that you will go away experts - we want you to leave thinking you need to know more and read more - further reading at end of slides - Much of the topics discussed are wider than just your research degree - But many of the principles are applicable to your research degree
- He and Paul Drummond in CIS will be looking at developing policy and systems support across the University over the next 3 years, in line with policy directions from the UK and Europe, and you may meet him over that time.
- He will also be looking at the need to develop and provide additional training and guidance.
5 minutes discussion in groups of 3-4 / yell out to front of class
Different communities do things differently: files, tools, processes
Colours representing various research/data activities. Cleary we can see tree stages:
Top of the graph where data are gathered
In the middle is a research process where we analyse all data
In the bottom more data are produced
All of those data are a subject for RDM.
Patterns of information use and exchange: case studies of researchers in the life sciences. A report by the Research Information Network and the British Library November 2009
There is no single definition, so lets agree some basics...
Situational – eg cctv footage (crime prevention, measure of footfall, demographic data) / ships logs (historical record of events, record of weather patterns, personnel lists) / photographs (historical record of objects or locations, a source of data on techniques or chemical processes of photo development)
There is no single definition, so lets agree some basics...
Situational – eg cctv footage (crime prevention, measure of footfall, demographic data) / ships logs (historical record of events, record of weather patterns, personnel lists) / photographs (historical record of objects or locations, a source of data on techniques or chemical processes of photo development)
Data are core element of scientific research thus data lifecycle can’t be considered independently from research lifecycle.
Ask students where they are storing their data? - are they backing it up - what to they plan to do with it once completed? - what if they are asked for it in 7 years time? - if only on one device, what happens if that device is stolen/lost/damaged?
Ask students where they are storing their data? - are they backing it up - what to they plan to do with it once completed? - what if they are asked for it in 7 years time? - if only on one device, what happens if that device is stolen/lost/damaged?
Or group the images,
pre project,
active project,
new activities preserving and giving access.
Click on image to go through to Data Archive
Creating Data: You need to plan ahead. What storage will you require? What formats will the data be in, and how will this be supported? What ethical and legal considerations do you need to take into account in both collecting and storing the data, and then how will this affect your ability to share the data.
Processing data: As you digitise, transcribe, translate, anonymise, check and clean the data created or collected, you need to start to put some of you planning into practice: storing data, describing data. Here you might be creating new sets of metadata which will be key to any future re-use: your notebooks, codebooks (if coding qualitative data), recording decisions and workflows applied in cleaning and checking data.
Analysing data: Here is the bulk of your actual research, but you will at this point be needing to think about how the data can and should be preserved. For example, when publishing your research you may need to either provide the underpinning data, or indicate if, how and where it can be accessed by a reader– so you may need to be providing access to data from this point
Preserving data: The best format for the data for you to use may not be the best format for the data to be preserved for future use. So here you will need to be working with colleagues to ensure the data is stored, and backed-up effectively. To aid retrieval, you will also need to ensure the metatdata and documentation describing the data is robust. And finally, you will need to be thinking about how the preservation of the data will be ongoing.
Giving Access to data: This is how and where you provide access to the data, and make clear any copyright issues arising.
Re-using data: how is the data then re-used in further research… and the cycle begins again.
UKRIO Code of Practice for Research: Promoting good practice and preventing misconduct - http://www.ukrio.org/wp-content/uploads/UKRIO-Code-of-Practice-for-Research.pdf
3.12.5 Organisations should have in place procedures, resources (including physical space) and administrative support to assist researchers in the accurate and efficient collection of data and its storage in a secure and accessible form.
3.12.6 Researchers should consider how data will be gathered, analysed and managed, and how and in what form relevant data will eventually be made available to others, at an early stage of the design of the project.
Full text at: http://www.rcuk.ac.uk/research/Pages/DataPolicy.aspx
The data must be made available for re-use or archiving with the ESRC data service providers within three months of the end of the grant.
Funders are asking “why do you need to collect new data, it may already exist”
Data retention 3 years after publication to 10 years since last request
The Research Data Management Steering Committee on behalf of UEC is responsible for ensuring the provision of:
- training, support and advice for the management of research data
- services and procedures enabling the registration, deposit, storage, and curation of research data
Researchers are responsible for
- planning and documenting a clear research data management plan as part of their application which should include procedures for collection, storage, use, re-use, access, retention and eventual destruction of data
- planning for the curation and management of their data after the completion of their project
- ensuring that any requirements relating to research data management specified in a research funding contract are met by the research data management plan
- ensuring that a copy of the data management plan is registered with the University’s Research Administration System (RAS)
You also have requirements or moves to recognise the need to manage and share data from other organisations.
Mention also that journals (eg in biosciences) may require you to submit data alongside a published article as standard practice.
Plose One from 1/02/2014
Authors must provide a Data Availability Statement describing compliance with PLOS’s policy, which will be published with paper.
Two key things to summarize about the policy are:
The policy does not aim to say anything new about what data types, forms and amounts should be shared.
The policy does aim to make transparent where the data can be found, and says that it shouldn’t be just on the authors’ own hard drive.
What would happened if you lost external hard drive with few years of research data for your PhD?
Image from Peter Murray-Rust blog CC-by - http://blogs.ch.cam.ac.uk/pmr/2011/08/01/why-you-need-a-data-management-plan/
http://www.bbc.co.uk/news/uk-scotland-glasgow-west-27556659
Also “… The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%.”
However, not normative, the citations for data is an essential component of data publication, sharing and reuse.
You might be thinking, I don’t want people to find out if I have made a mistake….
Well, you may, and then you can own up and move on. But what you should be more worried about is being able to identify if others have made a mistake and how that might impact on your research.
http://www.economist.com/blogs/freeexchange/2013/04/debt-and-growth
@rufuspollock Rufus Pollock,
1000 Genomes - is a community resource project that aims to release data rapidly for the benefit of the scientific community. Use of project data for:
Global and large-scale analyses
Methods development
Disease studies
Population comparisons
Iris Flower – from Machine Learning repository
I’ll not give you a detailed tips how to manage data day to day. However, I’ll be taking about new requirements DMP. I’m sure that some of you are already contributing to departmental peer reviewer processes. Creating a DMP might be useful way l to share best practices in DM in your subject.
Section emphasis only on DMP, specific tips has been removed.
DCC. (2013). Checklist for a Data Management Plan. v.4.0. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/data-management-plans
http://www.dcc.ac.uk/sites/default/files/documents/resource/DMP_Checklist_2013.pdf
an explanation of the existing data sources that will be used by the research project with references
an analysis of the gaps identified between the currently available and required data for the research
- information on the data that will be produced by the research project, including:
data volume; data type, eg qualitative or quantitative data; data quality, formats, standards documentation and metadata; methodologies for data collection
- planned quality assurance and back-up procedures [security/storage]
- plans for management and archiving of collected data
- expected difficulties in data sharing, along with and causes and possible measures to overcome these difficulties
- explicit mention of consent, confidentiality, anonymisation and other ethical considerations
- copyright and intellectual property ownership of the data
- responsibilities for data management and curation within research teams at all participating institutions.
The importance of data management planning https://www.youtube.com/watch?v=PXr14Urf268
One of the major challenges is communication between academics and other stakeholders
RO – RDM pages, a key hub for all RDM activities. Explore RDM resources library
You will have (RD) our (CIS) support … invite us to discussion with academics when you will talk on DMP aspects …
Sortable by name (so date first can be useful)
Where version control is important, should be clear in the name. Do not just move to a different folder or name as “draft” or “old”
Distinguishable: don’t have files with the same name in different folders, as this could end up causing confusion if files are copies elsewhere or re-used.
Sortable by name (so date first can be useful)
Where version control is important, should be clear in the name. Do not just move to a different folder or name as “draft” or “old”
Distinguishable: don’t have files with the same name in different folders, as this could end up causing confusion if files are copies elsewhere or re-used.
Organisation; helps facilitate future retrieval
Context; helps judge content without opening
Consistency; benefits processing growing number of files
Think labels which helps to retrieve a document later, I might only remember part of the name, but context will help me to judge if this is the file I’m looking for
Sortable by name (so date first can be useful)
Where version control is important, should be clear in the name. Do not just move to a different folder or name as “draft” or “old”
Distinguishable: don’t have files with the same name in different folders, as this could end up causing confusion if files are copies elsewhere or re-used.
Capitalisation – UNIX capitalisation might distinguish between two entirely different files
Searching for r will not find R
Capitalisation – UNIX capitalisation might distinguish between two entirely different files
Capitalisation – UNIX capitalisation might distinguish between two entirely different files
Capitalisation – UNIX capitalisation might distinguish between two entirely different files
Public link to life evernote https://www.evernote.com/l/AWc0uDrxCLpAdoXuif3f7g7vK75gh2jFhXY
Introduce Google search concept – keywords phrase
Important if sharing the data on a repository.
ODIN cover page - http://figshare.com/articles/D2_3_First_year_communication_report_including_results_from_first_year_event/843603
DDI – Data Documentation Initiative
METS - Metadata Encoding and Transmission Standard
TEI – Text Encoding Initiative
QDE – QuDEx – Qualitative Data Exchange
Examples – Microsoft excel example to use? Older versions?Microsoft works files in earlier versions of Word.
Comprehensive list of file formats http://en.wikipedia.org/wiki/List_of_file_formats
Example data file formats supported by 3TU Datcentrum - http://datacentrum.3tu.nl/en/what-we-offer/data-archive/#formats
or recommended file formats from UK Data Archivum - http://www.data-archive.ac.uk/create-manage/format/formats
Present keychain
Password generator
Default system password prompt
GPG
GPG getting started https://www.gnupg.org/gph/en/manual/c14.html
http://www.data-archive.ac.uk/create-manage/storage/encrypt
https://fedoraproject.org/wiki/Creating_GPG_Keys
Re3data: registry of research data repositories
Databib, another service for locating data repositories.
Datacite: service to provide unique DOIs to data sets for citation, but also include a register of data sets and repositories. Linked to
Data from Figure 7 from: Measurements of Higgs boson production and couplings in diboson final states with the ATLAS detector at the LHC
Be aware on copyright constraints. Applying commons license will not take your IPR, people will gain right to freely use and distribute but they must properly attribute you work.
Written consent includes and information sheet and a consent form signed by the participant.
Part of the EPSRC project to developed new research methods in real life - http://www.reallifemethods.ac.uk
Example consents forms http://www.data-archive.ac.uk/create-manage/consent-ethics/consent?index=3 interviews, children participation.
An information sheet should cover topics such as:
Purpose of the research
Benefits and risk
Terms for withdrawal
Usage of data during project and after
Ethical use of data
Example landing page with citation information http://esds.ac.uk/doi/?sn=6614
Citation obtained from http://crosscite.org/citeproc/
Different influences -> different plans
Broader: country, body of foundation, outcome – commercial or public domain, weather the work is reproducible or not
Founder: desirable place for long-term curation, data in certain formats
Internal requirements: institutional repositories, self-imposed ethics, softer influences related to disciplinary difference or even personal preferences
Publisher: ownership of copyright signed over not compatible with institutional policies
Legal: the UK/EU legislation – such as recent Dropbox issue – safe harbour agreements etc
Legal: Example with paediatric research, legal requirements to seek consent once children are grown up
Cover image: The Spanish Cucumber E. Coli. In May 2011, there was an outbreak of a unusual Shiga-Toxin producing strain of E.Coli, beginning in Hamburg in Germany. This has been dubbed the ‘Spanish cucumber’ outbreak because the bacteria were initially thought to have come from cucumbers produced in Spain. … this genome was analysed within weeks because of a global and open effort; data about the strain’s genome sequence were released freely over the internet as soon as they were produced.