A talk given at the Geological Society of London, UK on 2016/03/09 as part of the Lyell meeting on Palaeoinformatics. http://www.geolsoc.org.uk/lyell16 #lyell16
2. About Me
Currently a Postdoc at
a Fellow of the
('Class of 2016')
a researcher with
plantsci.cam.ac.uk
software.ac.uk/fellows
contentmine.org
3. About This Talk (A little warning!)
â Don't expect to see much biology in this talk
â I'm going to talk about informatics
â I will focus more on context, background and methods,
more than 'results' per se
â There will be more questions than answers :)
5. ïŹ New
ïŹ Open Data
ïŹ Easy-to-use
ïŹ Quick
ïŹ Images
ïŹ Audio
ïŹ Interactive Maps
ïŹ Citable
ïŹ API access
ïŹ Open Source
Infrastructure
Itâs not KE Emu :)
6. What I want to do:
link specimen records to their mentions in the literature
âMicro-computed tomography scan slice through four bat skulls, displaying the relative position of
the three semicircular canals within the skull. Scans are from the following species: (A) Pteropus
rodricensis (BMNH.76.3.15.14); âŠâ
NHM Data Portal Link (Stable, Unique Identifier)
http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)
http://dx.doi.org/10.1371/journal.pone.0061998
7. 114,000,000
scholarly papers available online
36,000,000 of which are
âBiologyâ / âEnvironmental Studiesâ / âGeosciencesâ / âMultidisciplinaryâ
Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
8. Sadly, the vast majority of papers are only âavailableâ online to paying subscribers and no
institution in the world has access to everything. Not even close to everything!
9. In 2016, libraries pay subscriptions, or individuals per article fees
to access even out of copyright works
??
http://outofcopyright.eu/rights-after-digitisation/
11. This is what a PDF looks like
PDF is NOT a
good method
of exchanging
information
12. HTML is better, but lacks
standardisation
+ italics & bold preserved, semantic links to figures & tables - lacks standardisation
13. The industry standard format for
scholarly articles is JATS XML
â Journal Article Tags Archiving Suite
is an application of NISO Z39.96-2015, which defines a set of XML elements and
attributes for tagging journal articles
â Standardising the format of digital scholarly publications is HIGHLY desirable
e.g. for this project, knowing if the string 'NHM' occurrs in the Materials section, rather
than the Acknowledgements section is hugely helpful.
Much harder to do with PDF/HTML.
Section-based search already implemented in EuropePMC!
â Section level search functionality in Europe PMC. Kafkas et al (2015) J Biomed Semantics
14. A plea for full text XML
A minority of journals do not provide full text XML
âPLOS, eLife, PeerJ, Pensoft, Wiley, Elsevier, Springer,
NPG, Ubiquity Press, Copernicus, Hindawi, MPDI
â Geological Society of London Publications,
Magnolia Press, a long tail of smaller publishers
16. Image credit: Ubiquity Press
http://ubiquitypress.tumblr.com/post/96012592921/the-right-to-read-is-the-right-to-mine
UK Copyright Law has
changed recently,
giving a specific
copyright exemption
for non-commercial
text and data mining
work
17. A complicated, fragmented landscape of relevant journals
Nature + Science + PNAS + Phytotaxa + Zootaxa
BioOne Journals (131)
Springer Journals (32)
Wiley Journals (22)
Taylor & Francis Journals (14)
Elsevier Journals (12)
Oxford University Press Journals (8)
SciELO Journals (7) [Open Access but not in PMC]
Ecological Society of America Journals (6)
Geological Society Journals (4)
CSIRO Journals (4)
Cambridge University Press Journals (3)
Royal Society Journals (2)
Journal-omics!
18. I discover 'new' journals every week
e.g. last week I 'found' Oryctos (published between 1998-2010), still behind a
paywall. Does anyone have access to this journal? Please let me know
http://www.dinosauria.org/oryctos.php
How are we meant to achieve a comprehensive
aggregation of research literature (to do rigorous science,
inclusive of all the evidence) when it is so unhelpfully
scattered and we don't even know where it all is?
20. I donât just find in-text mentions.
Iâm trying to match them up to our
NHM Data Portal records too!
Specimens in RED do not appear
to be on the Data Portal ...yet
Blue globe represents a PLOS ONE paper
21. Searching ALL full texts is
not enough!!!
A significant number of specimens are
probably âhiding-outâ in supplementary
data files of all sorts of formats.
Google Scholar does not index SI
Web of Science doesnât either
Nor does Scopus
At scale, journal-held supplementary
data files are the âdarkest cornersâ of
science
âSpecimens were deposited in the collections of the California Academy of Sciences' Department of
Herpetology (CAS), the British Museum of Natural History (BMNH) and of author GJM (Table S1)â
10.1371/journal.pone.0104628 http://rossmounce.co.uk/2015/06/20/deep-indexing-supplementary-data-files/
22. Why write such descriptive papers in natural
language? Keep data as data!
The above was published in 2013(!)
23. Almost nothing in Nature & Science âfull (short) textâ
Context: 15 years worth of full text research in Nature & Science examined
Science: only 11 NHM specimens found in 39,600 full texts.
Nature: similar story. <30 specimens in 14,132 full texts.
Clearly there are more,
but itâs all buried in supplementary materials :(
24. Blue globe represents a PLOS ONE paper
Very few specimens occur in more than one paper
Can you guess what BMNH 37001 is?
Hint: itâs a very famous specimen! Grey represents an NHMUK specimen
25. Huge variation in how specimens are cited (not helpful!)
PI AZ 8459 TEXSpruce6067
BM000922891 NYRaz054
BMNH(E)609062 MSB00509
Belize_CW_All_1071 F1629082
BM-BRIT-EURO 3948 OR.5379
âBMNHâ is not necessarily British Museum of Natural History (UK).
Can also be Beijing Museum of Natural History (CN) or Bell Museum of Natural History (US)
26. Where possible use standard/permanent identifiers
Want to discuss a particular collection? Use the official GrSciColl identifier
The Global Registry of Scientific Collections (GRSciColl)
http://grscicoll.org/
Which for the Natural History Museum, London (UK) is: NHMUK
http://biocol.org/urn:lsid:biocol.org:col:34665
Want to cite the BM Archaeopteryx specimen?
NHMUK PV OR 37001
http://data.nhm.ac.uk/object/57ee3bf1-0a74-4ae4-a588-ba9ea8dc5265
27. Credit: Davies KTJ, Bates PJJ, Maryanto I, Cotton JA, Rossiter SJ (2013) The
Evolution of Bat Vestibular Systems in the Face of Potential Antagonistic Selection
Pressures for Flight and Echolocation. PLoS ONE 8(4): e61998.
doi:10.1371/journal.pone.0061998
Openly-licensed data on specimens, published elsewhere, could
be re-incorporated back into the online museum catalogue. A
one-stop shop for information.
Beyond-linking:
repatriation of knowledge
This is a CT-scan of âBMNH 76.3.15.14â.
Without mining, I wouldnât know this data exists.
Perhaps it could also be made available on the portal?
http://data.nhm.ac.uk/specimen/69e97f52-0275-
4a82-9fa6-cf1c3949f408
28. Does published info make it back âhomeâ to the collections?
BMNH 2013.2.13.3 on the portal as âPetrochromis nov.sp. Takahashiâ
I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9
Itâs now called: Petrochromis horiin. sp. , according to the paper.
What mechanisms are there to update newer information back into the collection?
Content mining could definitely help keep collections data up-to-date!
29. Can we create a (better) digital NHM metadata catalogue
entirely from the literature, hundreds of years before the NHM
themselves complete their own digitisation programme?
Given funding and time, perhapsâŠ
30. Acknowledgements
Sincere thanks to:
Aime Rankin for help with the project
The NHM Library staff, particularly Sarah Vincent for actively supporting my content mining
Nancy Chillingsworth (IPR, NHM London)
Mark Wilkinson (Life Sciences, NHM London)
Peter Murray-Rust & the ContentMine team
Vince Smith (Life Sciences, NHM London)
Ben Scott (NHM Data Portal Lead Architect)
Rod Page (University of Glasgow)
All of the Biodiversity Informatics team
http://contentmine.org/
For a more detailed version of this talk on
YouTube see: bit.ly/nhmlink