Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Scaling mondrian
Download to read offline and view in fullscreen.



Revolutionising the Journal through Big Data Computational Research

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Revolutionising the Journal through Big Data Computational Research

  1. 1. Revolutionizing the Journal through Big Data Computational Research Amye Kenall Journal Development Manager, Open Data DataCite Annual Conference Inist-CNRS Vandoeuvre-lès-Nancy, France 26 August 2014
  2. 2. 2 Who are we? • Founded in 2000 (bought by Springer in 2008) • Publish over 260 open access journals • ~25,000 peer reviewed research articles published annually • Genomics and computational biology are a significant fraction e.g. Genome Biology, BMC Genomics, BMC Bioinformatics • Other key fields include • Public Health / Global Health / Infectious Disease • Cancer • All research articles are CC-BY licensed for reuse • Since mid 2013, all data is covered by a CC0 rights waiver
  3. 3. 3 Data reuse @BioMedCentral • Strong encouragement to authors of all journals to provide underlying datasets and required on a select number (eg. Genome Biology, Genome Medicine, GigaScience) • CC0 + CC-BY 4.0 by default In the works… • Interactive tabular data • DOIs for all additional files • Searchability of additional files • Data Citation clearly tagged in XML to aid harvesting e.g. Data Citation Index • Availability of Data section and Data Citation • Encourage use of ISA-TAB (especially GigaScience and BMC Research Notes)
  4. 4. 4
  5. 5. 5 Journal, data-platform and database for large-scale data In conjunction with
  6. 6. 6
  7. 7. 7 Linking and Citation
  8. 8. 8 Publishing Reproducible Science: SOAPdenovo2, a case study
  9. 9. 9
  10. 10. 10
  11. 11. 11
  12. 12. 12
  13. 13. 13
  14. 14. 14 Lessons Learned? • With enough work, results can be replicated with a push of a button. • But a lot of work costs a lot of money! No one would pay an APC that reflects that cost. • Learn a huge amount about the study and provides a lot of information not present in the paper. • Needs to happen before publication.
  15. 15. 15 Reproducibility of computational research • Computational research in principle should be easier to replicate/reproduce than bench studies • However, practical issues get in the way • Even if source code is shared, reproducing entire technical setup/porting software, gathering appropriate input data, rerunning analysis is a significant effort • This means readers and even reviewers don’t bother • We would like to reduce this ‘activation energy’
  16. 16. 16 Strong interest from potential partners
  17. 17. 17 Key technologies
  18. 18. 18 + + Article Technologies Partners Journal
  19. 19. 19
  20. 20. 20
  21. 21. 21
  22. 22. 22
  23. 23. 23
  24. 24. 24 Flexible management/deployment of packaged data/analysis suites using VM infrastructure
  25. 25. 25 Complementary roles of publishers, academia, and cloud providers • Publishers have role in enforcement of community standards • Public/academic databases can provide credible long term archiving for key data with a focus on curation and metadata standards • Academic grid computing infrastructure can provide access for researchers to large-scale computing resource • Commercial cloud providers universalize/democratize access to large-scale computing. Even if you are not at an institution with its own facilities, you can carry out high-end computations. No bureaucracy/politics – simply pay per CPU-hour.
  26. 26. 26 Specific challenges with respect to data • To what extent can/should datasets be included in the VM/suite or pulled in externally? • How can we avoid the costliness of moving data around, as it gets bigger and bigger? • To what extent are cross-domain standards for referring to and pulling in underlying datasets feasible. Dataset DOIs typically point to metadata • Multiple versions of datasets. To what extent is it practical, when dealing with evolving datasets/databases, to make them available as reproducible snapshots? • Culture of data sharing. How to get authors to share their data?
  27. 27. 27 Conclusions • With big data and computational tools, research is becoming more “reproducible/reusable” • The infrastructure is out there; we need to do a better job of using it • What authors need to communicate their research is also changing, and as publishers we must respond • Clear publishers have a role, with other organisations, in setting some community standards • It took a few 100 years, but publishing is now getting exciting
  28. 28. 28 Questions? “One reason that the worldwide web worked was because people reused each other’s content in ways never imagined or achieved by those who created it. The same will be true of open data.” – Tim Berners-Lee and Nigel Shadbolt, The Times, New Year’s Eve 2011 Amye Kenall Journal Development Manager (Open Data), BioMed Central @AmyeKenall (also @OpenDataBMC)


Total views


On Slideshare


From embeds


Number of embeds