SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Museum Impact
Linking-up our specimens with
research published on them
Dr Ross Mounce
@rmounce
Talk Structure
● Background: the collections, the research literature
● Interesting things you should know about access to research
○ The costs of knowledge $$$
● Examples of content mining
○ Including a video demo!
● My work (in progress) on finding NHM specimens in recent literature
Source: http://www.nhm.ac.uk/our-science/collections.html © The Trustees of the Natural History Museum, London
● New
● Open Data
● Easy-to-use
● Quick
● Images
● Audio
● Interactive
Maps
● Citable
● API access
● Open Source
Infrastructure
It’s not KE Emu :)
What I want to do:
link specimen records to their mentions in the literature
“Micro-computed tomography scan slice through four bat skulls, displaying the relative
position of the three semicircular canals within the skull. Scans are from the following
species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …”
NHM Data Portal Link (Stable, Unique Identifier)
http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)
http://dx.doi.org/10.1371/journal.pone.0061998
114,000,000
scholarly papers available online
36,000,000 of which are
‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’
Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and
no institution in the world has access to everything. Not even close to everything!
Cheryl Hall (2014) FOI request https://www.whatdotheyknow.com/request/academic_journal_subscription_co
We rent access to knowledge. Companies profiteer from it
2004/05 £357,197.79
2005/06 £383,214.29
2006/07 £340,690.33
2007/08 £381,526.57
2008/09 £441,706.36
2009/10 £437,539.71
2010/11 £430,105.08
2011/12 £449,515.12
2012/13 £469,007.50
2013/14 £494,913.01
10-year-total: £4,185,415.76
Tax Year Revenue Profit Profit Margin
2004 £1363m £460m 33.75%
2005 £1436m £449m 31.25%
2006 £1521m £465m 30.57%
2007 £1507m £477m 31.65%
2008 £1700m £568m 33.41%
2009 £1985m £693m 34.91%
2010 £2026m £724m 35.74%
2011 £2058m £768m 37.30%
2012 £2063m £780m 37.81%
2013 £2126m £826m 38.85%
Source: RELX Group (Parent company of Elsevier) Company Reports
Actually, the NHM’s annual bill isn’t bad compared to others
Source: Lawson S and Meghreblian B. (2015) Journal subscription
expenditure of UK higher education institutions. F1000Research
http://shiny.retr0.me/journal_costs/
Content Mining provides more bang for your buck
Making fuller use of our expensively provisioned access
● If the NHM is going to pay £500,000 per year to rent journals, why not use the
access to this resource to its fullest?
● I can’t read everything with my human eyes but…
computers can!
● If you can process one document with a computer,
you can process a million: content mining
Recent examples of Content Mining
Fig. 6 from the paper
Brachiopod body-size
estimates
Red-line humans
Grey bars machines
(PaleoDeepDive)
Better than PaleoDB ?
I think so. PDD more clearly-linked to evidence than PDB
Provenance matters.
Recent examples of Content Mining (Images)
3-second
image analysis
source: 10.1099/ijs.0.65212-0
(Zymobacter_palmae:261,((((Chromohalobacter_canadensis:42,
(Chromohalobacter_sarecensis:96,Chromohalobacter_nigrandesensls:154):41):80,
(Chromohalobacter_marismortui:125,Chromohalobacter_beijerinckii:103):164):61,
(Chromohalobacter_israelensis:11,Chromohalobacter_salexigens:11):92):293,
((Halomonas_halodurans:328,(Halomonas_ventosae:100,(Halomonas_pacifica:116,
(Halomonas_halophila:223,(Halomonas_eurihalina:27,Halomonas_elongate:58):236):
79):41):46):72,(Halomonas_desiderata:187,(Halomonas_pantelleriensis:173,
Halomonas_muralis:190):70):30):110):187);
outputs re-usable Newick
& NeXML
no manual input required
Can replot data,
re-analyse,
combine many to make a supertree!
PLUTo Project
Mounce, Murray-Rust, Wills (in prep.)
How to get a sufficient volume of journal articles?
● The ContentMine (CM) team are actively developing new tools
& training workshops to help researchers get into content
mining: be it text, data, or image mining
● CM are a not-for-profit Shuttleworth-funded project led by
Peter Murray-Rust
● All the software tools are open source and available on github:
https://github.com/ContentMine/
● I’m a Scientific Advisor with the ContentMine
● Try getpapers OR quickscrape to get journal content en masse
https://github.com/ContentMine/getpapers
https://github.com/ContentMine/quickscrape
http://contentmine.org/
● No problem. PMC to the rescue!
● PMC has a full text
Open Access-only subset which
you can download easily for free
● >1,100,000 full texts in XML
(compressed) is just 16.6GB
Want to download more than a million (OA) papers?
Source: Neil Saunders (2014) https://rpubs.com/neilfws/45828http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
Are there NHM specimens in the PMC OA subset?
PMC is medically-focused, so one wouldn’t expect it to be
rich in organismal biology, however …some relevant content
ALL of PLOS ONE is in the PMC OA subset.
Over 100,000 articles in that journal alone!
https://github.com/rossmounce/NHM-specimens
Version-controlled data on github
open for scrutiny & collaboration
Searching ALL full texts
is not enough!!!
A significant number of specimens
are probably ‘hiding-out’ in
supplementary data files of all sorts
of formats.
Google Scholar does not index SI
Web of Science doesn’t either
Nor does Scopus
At scale, journal-held
supplementary data files are the
‘darkest corners’ of science“Specimens were deposited in the collections of the California Academy of Sciences'
Department of Herpetology (CAS), the British Museum of Natural History (BMNH) and of
author GJM (Table S1)” 10.1371/journal.pone.0104628 http://rossmounce.co.uk/2015/06/20/deep-indexing-supplementary-data-files/
I don’t just find in-text mentions.
I’m trying to match them up to our
NHM Data Portal records too!
Specimens in RED do not
appear to be on the Data Portal
...yet
Blue globe represents a PLOS ONE paper
Blue globe represents a PLOS ONE paper
Very few specimens occur in more than one paper
Can you guess what BMNH 37001 is?
Hint: it’s famous! Grey represents an NHMUK specimen
Mining over 200 subscription access / non-PMC journals
from 2000 <-> 2015 inclusive
Nature + Science + PNAS + Phytotaxa + Zootaxa
BioOne Journals (131)
Springer Journals (32)
Wiley Journals (22)
Taylor & Francis Journals (14)
Elsevier Journals (12)
Oxford University Press Journals (8)
SciELO Journals (7) [Open Access but not in PMC]
Ecological Society of America Journals (6)
Geological Society Journals (4)
CSIRO Journals (4)
Cambridge University Press Journals (3)
Royal Society Journals (2)
Journal-omics!
Thanks to a recent change in
UK copyright law:
text and data mining for non-
commercial research purposes
is legal (in the UK),
(provided that you have
legitimate access to the
resource you want to mine e.g.
a paid-for institutional
subscription)
http://blogs.lse.ac.uk/impactofsocialsciences/2014/06/04/the-right-to-read-is-the-right-to-mine-tdm/
Image credit: Ubiquity Press
http://ubiquitypress.tumblr.com/post/96012592921/the-right-to-read-is-the-right-to-mine
So far… (very much still in progress)
Almost nothing in Nature & Science ‘full (short) text’
Context: 15 years worth of full text research in Nature & Science examined.
Science: only 11 NHM specimens found in ~39,600 texts.
Nature: similar story. <30 specimens in 14,132 ‘full’ texts.
Clearly there are more, but it’s all buried in supplementary materials :(
Shoving all the research details into non-searchable
supplementary materials is bad for science
● For the avoidance of doubt, this is not a criticism of authors. This is squarely
aimed at journals that artificially restrict the ‘length’ of research articles online.
e.g. Prufer, K. et al. 2014. The complete genome sequence of a Neanderthal from
the Altai Mountains. Nature 2014, 505, 43-49.
7-pages (in paper), 12-pages (in PDF, with extra data tables & figures)
The supplementary data file?
249 pages!
Someone needs to build a searchable index of
supplementary data. ASAP
Interactive plot: https://plot.ly/~rossmounce/22.embed
“Micro-computed tomography scan slice through four bat skulls, displaying the relative
position of the three semicircular canals within the skull. Scans are from the following
species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …”
NHM Data Portal Link (Stable, Unique Identifier)
http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)
http://dx.doi.org/10.1371/journal.pone.0061998
Huge potential to go beyond mere linking-up of identifiers.
This specimen & others have been CT scanned in the PLOS ONE paper.
We could do data, media and knowledge ‘repatriation’ back to the museum/portal.
Credit: Davies KTJ, Bates PJJ, Maryanto I, Cotton JA, Rossiter SJ (2013)
The Evolution of Bat Vestibular Systems in the Face of Potential
Antagonistic Selection Pressures for Flight and Echolocation. PLoS ONE 8
(4): e61998. doi:10.1371/journal.pone.0061998
Openly-licensed data on specimens, published elsewhere,
could be re-incorporated back into the online museum
catalogue. A one-stop shop for information.
Beyond-linking:
repatriation of knowledge
This is a CT-scan of “BMNH 76.3.15.14”.
Without mining, I wouldn’t know this data exists.
Perhaps it could also be made available on the portal?
http://data.nhm.ac.uk/specimen/69e97f52-
0275-4a82-9fa6-cf1c3949f408
Does published info make it back ‘home’ to the collections?
BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi”
I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9
It’s now called: Petrochromis horii n. sp. , according to the paper.
What mechanisms are there to update newer information back into the collection?
Content mining could definitely help keep collections data up-to-date!
Acknowledgements
● Sincere thanks to:
○ The NHM Library staff, particularly Sarah Vincent for actively supporting my content mining
○ Nancy Chillingsworth (IPR, NHM London)
○ Mark Wilkinson (Life Sciences, NHM London)
○ Peter Murray-Rust & the ContentMine team
○ Vince Smith (Life Sciences, NHM London)
○ Ben Scott (NHM Data Portal Lead Architect)
○ Rod Page (University of Glasgow)
○ All of the Biodiversity Informatics team
http://contentmine.org/
Please ask me questions!
Feedback appreciated :)
@rmounce
ross.mounce@nhm.ac.uk

Weitere ähnliche Inhalte

Was ist angesagt?

Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDMpetermurrayrust
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetRoss Mounce
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningRoss Mounce
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literaturepetermurrayrust
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016TheContentMine
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literaturepetermurrayrust
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trustpetermurrayrust
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature TheContentMine
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!petermurrayrust
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSSpetermurrayrust
 
ContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC DigifestContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC Digifestpetermurrayrust
 
Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career ResearchersRoss Mounce
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS TheContentMine
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature TheContentMine
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literatureHigh throughput mining of the scholarly literature
High throughput mining of the scholarly literaturepetermurrayrust
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literaturepetermurrayrust
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature TheContentMine
 
Content Mining of Science in Europe
Content Mining of Science in EuropeContent Mining of Science in Europe
Content Mining of Science in Europepetermurrayrust
 
MESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataMESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataHerbert Van de Sompel
 

Was ist angesagt? (20)

Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDM
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yet
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Cochrane workshop2016
Cochrane workshop2016Cochrane workshop2016
Cochrane workshop2016
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
ContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC DigifestContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC Digifest
 
Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career Researchers
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literatureHigh throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Content Mining of Science in Europe
Content Mining of Science in EuropeContent Mining of Science in Europe
Content Mining of Science in Europe
 
MESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataMESUR: Making sense and use of usage data
MESUR: Making sense and use of usage data
 

Andere mochten auch

The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014Ross Mounce
 
How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? Nancy Pontika
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Kaitlin Thaney
 
Subscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundariesSubscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundariesAlex Holcombe
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open DataRoss Mounce
 
SocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingSocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingKent Anderson
 
Research publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challengeResearch publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challengeRon Martinez
 
Open Access: Which Side Are You On
Open Access: Which Side Are You OnOpen Access: Which Side Are You On
Open Access: Which Side Are You OnJill Cirasella
 
Fifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationFifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationhierohiero
 

Andere mochten auch (10)

The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014
 
How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why?
 
Open Access Publishing, Threat or Opportunity?
Open Access Publishing, Threat or Opportunity?Open Access Publishing, Threat or Opportunity?
Open Access Publishing, Threat or Opportunity?
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
 
Subscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundariesSubscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundaries
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
SocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingSocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meeting
 
Research publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challengeResearch publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challenge
 
Open Access: Which Side Are You On
Open Access: Which Side Are You OnOpen Access: Which Side Are You On
Open Access: Which Side Are You On
 
Fifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationFifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly information
 

Ähnlich wie Museum Impact: Linking Specimens with Research

ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKpetermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neurosciencepetermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome TrustTheContentMine
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic BiologyTheContentMine
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biologypetermurrayrust
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Jisc
 
THOR Workshop - Data Publishing PLOS
THOR Workshop - Data Publishing PLOSTHOR Workshop - Data Publishing PLOS
THOR Workshop - Data Publishing PLOSMaaike Duine
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literaturepetermurrayrust
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulTheContentMine
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulpetermurrayrust
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesespetermurrayrust
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchDatapetermurrayrust
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Librariespetermurrayrust
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesTheContentMine
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machinespetermurrayrust
 
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHHigh throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHpetermurrayrust
 

Ähnlich wie Museum Impact: Linking Specimens with Research (20)

ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
BHL Tech Report
BHL Tech ReportBHL Tech Report
BHL Tech Report
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
THOR Workshop - Data Publishing PLOS
THOR Workshop - Data Publishing PLOSTHOR Workshop - Data Publishing PLOS
THOR Workshop - Data Publishing PLOS
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literature
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and theses
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Libraries
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHHigh throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIH
 

Mehr von Ross Mounce

Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Ross Mounce
 
Social Media For Researchers
Social Media For ResearchersSocial Media For Researchers
Social Media For ResearchersRoss Mounce
 
Social Media for Science
Social Media for ScienceSocial Media for Science
Social Media for ScienceRoss Mounce
 
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Ross Mounce
 

Mehr von Ross Mounce (7)

Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
 
Social Media For Researchers
Social Media For ResearchersSocial Media For Researchers
Social Media For Researchers
 
Social Media for Science
Social Media for ScienceSocial Media for Science
Social Media for Science
 
Herding Cats
Herding CatsHerding Cats
Herding Cats
 
Content Mining
Content MiningContent Mining
Content Mining
 
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
 
ProgPal2011
ProgPal2011ProgPal2011
ProgPal2011
 

Kürzlich hochgeladen

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Kürzlich hochgeladen (20)

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 

Museum Impact: Linking Specimens with Research

  • 1. Museum Impact Linking-up our specimens with research published on them Dr Ross Mounce @rmounce
  • 2. Talk Structure ● Background: the collections, the research literature ● Interesting things you should know about access to research ○ The costs of knowledge $$$ ● Examples of content mining ○ Including a video demo! ● My work (in progress) on finding NHM specimens in recent literature
  • 3. Source: http://www.nhm.ac.uk/our-science/collections.html © The Trustees of the Natural History Museum, London
  • 4. ● New ● Open Data ● Easy-to-use ● Quick ● Images ● Audio ● Interactive Maps ● Citable ● API access ● Open Source Infrastructure It’s not KE Emu :)
  • 5. What I want to do: link specimen records to their mentions in the literature “Micro-computed tomography scan slice through four bat skulls, displaying the relative position of the three semicircular canals within the skull. Scans are from the following species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …” NHM Data Portal Link (Stable, Unique Identifier) http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408 Article DOI (Stable, Unique Identifier) http://dx.doi.org/10.1371/journal.pone.0061998
  • 6. 114,000,000 scholarly papers available online 36,000,000 of which are ‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’ Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
  • 7. Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and no institution in the world has access to everything. Not even close to everything!
  • 8. Cheryl Hall (2014) FOI request https://www.whatdotheyknow.com/request/academic_journal_subscription_co We rent access to knowledge. Companies profiteer from it 2004/05 £357,197.79 2005/06 £383,214.29 2006/07 £340,690.33 2007/08 £381,526.57 2008/09 £441,706.36 2009/10 £437,539.71 2010/11 £430,105.08 2011/12 £449,515.12 2012/13 £469,007.50 2013/14 £494,913.01 10-year-total: £4,185,415.76 Tax Year Revenue Profit Profit Margin 2004 £1363m £460m 33.75% 2005 £1436m £449m 31.25% 2006 £1521m £465m 30.57% 2007 £1507m £477m 31.65% 2008 £1700m £568m 33.41% 2009 £1985m £693m 34.91% 2010 £2026m £724m 35.74% 2011 £2058m £768m 37.30% 2012 £2063m £780m 37.81% 2013 £2126m £826m 38.85% Source: RELX Group (Parent company of Elsevier) Company Reports
  • 9. Actually, the NHM’s annual bill isn’t bad compared to others Source: Lawson S and Meghreblian B. (2015) Journal subscription expenditure of UK higher education institutions. F1000Research http://shiny.retr0.me/journal_costs/
  • 10. Content Mining provides more bang for your buck Making fuller use of our expensively provisioned access ● If the NHM is going to pay £500,000 per year to rent journals, why not use the access to this resource to its fullest? ● I can’t read everything with my human eyes but… computers can! ● If you can process one document with a computer, you can process a million: content mining
  • 11. Recent examples of Content Mining Fig. 6 from the paper Brachiopod body-size estimates Red-line humans Grey bars machines (PaleoDeepDive) Better than PaleoDB ? I think so. PDD more clearly-linked to evidence than PDB Provenance matters.
  • 12. Recent examples of Content Mining (Images) 3-second image analysis source: 10.1099/ijs.0.65212-0 (Zymobacter_palmae:261,((((Chromohalobacter_canadensis:42, (Chromohalobacter_sarecensis:96,Chromohalobacter_nigrandesensls:154):41):80, (Chromohalobacter_marismortui:125,Chromohalobacter_beijerinckii:103):164):61, (Chromohalobacter_israelensis:11,Chromohalobacter_salexigens:11):92):293, ((Halomonas_halodurans:328,(Halomonas_ventosae:100,(Halomonas_pacifica:116, (Halomonas_halophila:223,(Halomonas_eurihalina:27,Halomonas_elongate:58):236): 79):41):46):72,(Halomonas_desiderata:187,(Halomonas_pantelleriensis:173, Halomonas_muralis:190):70):30):110):187); outputs re-usable Newick & NeXML no manual input required Can replot data, re-analyse, combine many to make a supertree! PLUTo Project Mounce, Murray-Rust, Wills (in prep.)
  • 13. How to get a sufficient volume of journal articles? ● The ContentMine (CM) team are actively developing new tools & training workshops to help researchers get into content mining: be it text, data, or image mining ● CM are a not-for-profit Shuttleworth-funded project led by Peter Murray-Rust ● All the software tools are open source and available on github: https://github.com/ContentMine/ ● I’m a Scientific Advisor with the ContentMine ● Try getpapers OR quickscrape to get journal content en masse https://github.com/ContentMine/getpapers https://github.com/ContentMine/quickscrape http://contentmine.org/
  • 14.
  • 15. ● No problem. PMC to the rescue! ● PMC has a full text Open Access-only subset which you can download easily for free ● >1,100,000 full texts in XML (compressed) is just 16.6GB Want to download more than a million (OA) papers? Source: Neil Saunders (2014) https://rpubs.com/neilfws/45828http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
  • 16. Are there NHM specimens in the PMC OA subset? PMC is medically-focused, so one wouldn’t expect it to be rich in organismal biology, however …some relevant content ALL of PLOS ONE is in the PMC OA subset. Over 100,000 articles in that journal alone!
  • 18. Searching ALL full texts is not enough!!! A significant number of specimens are probably ‘hiding-out’ in supplementary data files of all sorts of formats. Google Scholar does not index SI Web of Science doesn’t either Nor does Scopus At scale, journal-held supplementary data files are the ‘darkest corners’ of science“Specimens were deposited in the collections of the California Academy of Sciences' Department of Herpetology (CAS), the British Museum of Natural History (BMNH) and of author GJM (Table S1)” 10.1371/journal.pone.0104628 http://rossmounce.co.uk/2015/06/20/deep-indexing-supplementary-data-files/
  • 19. I don’t just find in-text mentions. I’m trying to match them up to our NHM Data Portal records too! Specimens in RED do not appear to be on the Data Portal ...yet Blue globe represents a PLOS ONE paper
  • 20. Blue globe represents a PLOS ONE paper Very few specimens occur in more than one paper Can you guess what BMNH 37001 is? Hint: it’s famous! Grey represents an NHMUK specimen
  • 21. Mining over 200 subscription access / non-PMC journals from 2000 <-> 2015 inclusive Nature + Science + PNAS + Phytotaxa + Zootaxa BioOne Journals (131) Springer Journals (32) Wiley Journals (22) Taylor & Francis Journals (14) Elsevier Journals (12) Oxford University Press Journals (8) SciELO Journals (7) [Open Access but not in PMC] Ecological Society of America Journals (6) Geological Society Journals (4) CSIRO Journals (4) Cambridge University Press Journals (3) Royal Society Journals (2) Journal-omics!
  • 22. Thanks to a recent change in UK copyright law: text and data mining for non- commercial research purposes is legal (in the UK), (provided that you have legitimate access to the resource you want to mine e.g. a paid-for institutional subscription) http://blogs.lse.ac.uk/impactofsocialsciences/2014/06/04/the-right-to-read-is-the-right-to-mine-tdm/
  • 23. Image credit: Ubiquity Press http://ubiquitypress.tumblr.com/post/96012592921/the-right-to-read-is-the-right-to-mine
  • 24. So far… (very much still in progress)
  • 25. Almost nothing in Nature & Science ‘full (short) text’ Context: 15 years worth of full text research in Nature & Science examined. Science: only 11 NHM specimens found in ~39,600 texts. Nature: similar story. <30 specimens in 14,132 ‘full’ texts. Clearly there are more, but it’s all buried in supplementary materials :(
  • 26. Shoving all the research details into non-searchable supplementary materials is bad for science ● For the avoidance of doubt, this is not a criticism of authors. This is squarely aimed at journals that artificially restrict the ‘length’ of research articles online. e.g. Prufer, K. et al. 2014. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 2014, 505, 43-49. 7-pages (in paper), 12-pages (in PDF, with extra data tables & figures) The supplementary data file? 249 pages!
  • 27. Someone needs to build a searchable index of supplementary data. ASAP
  • 29. “Micro-computed tomography scan slice through four bat skulls, displaying the relative position of the three semicircular canals within the skull. Scans are from the following species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …” NHM Data Portal Link (Stable, Unique Identifier) http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408 Article DOI (Stable, Unique Identifier) http://dx.doi.org/10.1371/journal.pone.0061998 Huge potential to go beyond mere linking-up of identifiers. This specimen & others have been CT scanned in the PLOS ONE paper. We could do data, media and knowledge ‘repatriation’ back to the museum/portal.
  • 30. Credit: Davies KTJ, Bates PJJ, Maryanto I, Cotton JA, Rossiter SJ (2013) The Evolution of Bat Vestibular Systems in the Face of Potential Antagonistic Selection Pressures for Flight and Echolocation. PLoS ONE 8 (4): e61998. doi:10.1371/journal.pone.0061998 Openly-licensed data on specimens, published elsewhere, could be re-incorporated back into the online museum catalogue. A one-stop shop for information. Beyond-linking: repatriation of knowledge This is a CT-scan of “BMNH 76.3.15.14”. Without mining, I wouldn’t know this data exists. Perhaps it could also be made available on the portal? http://data.nhm.ac.uk/specimen/69e97f52- 0275-4a82-9fa6-cf1c3949f408
  • 31. Does published info make it back ‘home’ to the collections? BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi” I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9 It’s now called: Petrochromis horii n. sp. , according to the paper. What mechanisms are there to update newer information back into the collection? Content mining could definitely help keep collections data up-to-date!
  • 32. Acknowledgements ● Sincere thanks to: ○ The NHM Library staff, particularly Sarah Vincent for actively supporting my content mining ○ Nancy Chillingsworth (IPR, NHM London) ○ Mark Wilkinson (Life Sciences, NHM London) ○ Peter Murray-Rust & the ContentMine team ○ Vince Smith (Life Sciences, NHM London) ○ Ben Scott (NHM Data Portal Lead Architect) ○ Rod Page (University of Glasgow) ○ All of the Biodiversity Informatics team http://contentmine.org/
  • 33. Please ask me questions! Feedback appreciated :) @rmounce ross.mounce@nhm.ac.uk