SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
In grammars we trust: LeadMine,
a knowledge driven solution
Daniel Lowe and Roger Sayle
NextMove Software
Cambridge, UK
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Approaches to Entity
recognition
• Dictionary based
• Grammar based
• Machine Learning
LeadMineLeadMine
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Optional
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Normalization
Input Normalized
œstradiol oestradiol
5` or 5’ or 5′ (backtick/quotation mark/prime) 5'
<p>H<sub>2</sub>O</p> H2O
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Blue: Grammars
Green: Traditional dictionaries
Orange: Blocking dictionaries
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Advantages of grammars
• Don’t require annotated corpora
• Encode knowledge about the domain
• Very fast recognition
• Allow spelling correction if an entity is a near
match to one recognized by the grammar
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Simple grammar Example
Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’
Digit : Digit1to9 | ‘0’
Cid : ‘CID:’ Digit1to9 Digit*
C I D 1..9:
0..9
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Grammar for IUPAC names
• Grammar for complete molecules: 485 rules
– trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'...
– ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ...
• Generally aims to match a superset of the
nomenclature covered by IUPAC
• Specifically this is the superset that can be
theoretically be converted to structures
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Grammar inheritance
• Molecule grammar serves as a good starting
point for a substituent grammar or generic
chemical grammar
– Inherit rules rather than duplicate them
– Allow overriding of rules
pluralizedChemical : chemical 's'
elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition
metal'|'transuranic element' | _elementaryMetalAtom
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Dictionaries… bigger is better
• For high recall of trivial names, dictionaries
with high coverage are required.
• The largest publically available dictionary is
PubChem with over 94 million terms
• However most of these terms are either not
useful or actually detrimental to text mining
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Aggressive filtering
• “what you don't see won't hurt you”
• Hence remove terms are also English words or start with an
English word
– Accomplished using a large English dictionary with
chemistry terms removed
• Remove internal identifiers used by depositors
• Remove terms that are matched by our grammars
• Ultimate result: 94 million  2.94 million
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Structure Aware filtering
• “Do not tag proteins, polypeptides (> 15aa),
nucleic acid polymers, polysaccharides,
oligosaccharides [tetrasaccharide or longer] and other
biochemicals.”
• About 40,000 polypeptides and
oligosaccharides excluded from PubChem
using these criteria
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Entity Extension
• Even PubChem is far from comprehensive hence it can be
useful to extend the start and/or end of entities to avoid
partial hits
– α-santalol can be recognized from santalol in the
dictionary
• Extension is bracketing aware and blocked by English words
• Entity trimming also performed to comply with the
annotation guidelines
– ‘Allura Red AC dye’  ‘Allura Red AC’
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Entity Merging
• Adjacent entities may actually be part of one
entity
– Ethyl ester one entity
– (+)-limonene epoxide  one entity
BUT
– Hexane-benzene two entities
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Using an ontology to determine
when terms add information
• Genistein isoflavone  two entities
• Glycine ester  one entity
Genistein showing isoflavone core structure
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Abbreviation detection
• Based on the Hearst and Schwartz algorithm
• Detects abbreviations of the following forms:
– Tetrahydrofuran (THF)
– THF (tetrahydrofuran)
– Tetrahydrofuran (THF;
– Tetrahydrofuran (THF,
– (tetrahydrofuran, THF)
– THF = tetrahydrofuran
Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Domain-specific abbreviations
• Some abbreviations are not acronyms
• Can use string replacements to recognize
them e.g.
– Sodium  Na
– Estradiol  E2
Hence can recognize: 17α-ethinylestradiol  EE2
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Non-entity abbreviation
removal
• Finds entities detected as abbreviations of
unrecognized entities
– Can mean a common chemical abbreviation has
been redefined in the scope of the document
current good manufacturing practice (cGMP)
cGMP = Cyclic guanosine monophosphate =
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Making the most of the
knowledge provided
• Use training data to identify:
– Terms that are not currently recognized (whitelist)
– Terms that are often false positives (blacklist)
• Each false positive and false negative is placed
into such a list if its inclusion increased F-score
(harmonic mean of precision and recall)
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
CEM Task Results
(on development set)
Configuration Precision Recall F-score
Baseline 0.87 0.82 0.84
WhiteList 0.86 0.85 0.86
BlackList 0.88 0.80 0.84
WhiteList +
BlackList
0.87 0.83 0.85
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
CDI task ranking
• Uses precision of entities when running
against the development set with the results
broken down by:
– Title vs abstract?
– Which dictionary matched?
– Was the entity’s bounds modified?
– Did the entity occur more than once in the
document?
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Conclusions
• Grammars complement dictionaries to allow recognition
of novel entities
• Both the coverage and quality of dictionaries is
important
• The meaning of novel abbreviations can be determined
algorithmically
• Entities can be classified based on the resource that
recognized them
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com

Weitere ähnliche Inhalte

Was ist angesagt?

ICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CASICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CASDr. Haxel Consult
 
2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) englishPOSTECH Library
 
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...NextMove Software
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinSimon Jupp
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...NextMove Software
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...Dr. Haxel Consult
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4jSimon Jupp
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSValery Tkachenko
 
Building (and traveling) the data-brick road: A report from the front lines ...
Building (and traveling) the data-brick road:  A report from the front lines ...Building (and traveling) the data-brick road:  A report from the front lines ...
Building (and traveling) the data-brick road: A report from the front lines ...mhaendel
 
Equivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholdermhaendel
 

Was ist angesagt? (20)

ICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CASICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CAS
 
2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland Project
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
 
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
 
Why Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpiderWhy Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpider
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlin
 
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4j
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
 
Building (and traveling) the data-brick road: A report from the front lines ...
Building (and traveling) the data-brick road:  A report from the front lines ...Building (and traveling) the data-brick road:  A report from the front lines ...
Building (and traveling) the data-brick road: A report from the front lines ...
 
Equivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholder
 

Andere mochten auch

Infografik: Wie fit ist Deutschland für die Zukunft?
Infografik: Wie fit ist Deutschland für die Zukunft?Infografik: Wie fit ist Deutschland für die Zukunft?
Infografik: Wie fit ist Deutschland für die Zukunft?Bertelsmann Stiftung
 
Scaling mondrian
Scaling mondrianScaling mondrian
Scaling mondrianlucboudreau
 
8th grade list 2014
8th grade list 20148th grade list 2014
8th grade list 2014Liz Slavens
 
Receta pinxto banderilla olmeda origenes
Receta pinxto banderilla olmeda origenesReceta pinxto banderilla olmeda origenes
Receta pinxto banderilla olmeda origenesOlmeda Orígenes
 
Revolutionising the Journal through Big Data Computational Research
Revolutionising the Journal through Big Data Computational ResearchRevolutionising the Journal through Big Data Computational Research
Revolutionising the Journal through Big Data Computational ResearchAmye Kenall
 
Daily Newsletter: 15th December, 2010
Daily Newsletter: 15th December, 2010Daily Newsletter: 15th December, 2010
Daily Newsletter: 15th December, 2010Fullerton Securities
 
From Macro to Micro: Greening Your Campus HANDOUT
From Macro to Micro: Greening Your Campus HANDOUTFrom Macro to Micro: Greening Your Campus HANDOUT
From Macro to Micro: Greening Your Campus HANDOUTPaul Brown
 
Applying testing mindset to software development
Applying testing mindset to software developmentApplying testing mindset to software development
Applying testing mindset to software developmentAndrii Dzynia
 
Prueba de portada
Prueba de portadaPrueba de portada
Prueba de portadapatricio
 
Digital badging at the OU
Digital badging at the OUDigital badging at the OU
Digital badging at the OUDr Patrina Law
 
presentation for BPC
presentation for BPCpresentation for BPC
presentation for BPCjjoyce
 
Story Testimonial Pitch
Story Testimonial PitchStory Testimonial Pitch
Story Testimonial PitchGaurav Gaur
 
Information Architecture class13 04 10
Information Architecture class13 04 10Information Architecture class13 04 10
Information Architecture class13 04 10Marti Gukeisen
 

Andere mochten auch (18)

Infografik: Wie fit ist Deutschland für die Zukunft?
Infografik: Wie fit ist Deutschland für die Zukunft?Infografik: Wie fit ist Deutschland für die Zukunft?
Infografik: Wie fit ist Deutschland für die Zukunft?
 
Scaling mondrian
Scaling mondrianScaling mondrian
Scaling mondrian
 
8th grade list 2014
8th grade list 20148th grade list 2014
8th grade list 2014
 
Receta pinxto banderilla olmeda origenes
Receta pinxto banderilla olmeda origenesReceta pinxto banderilla olmeda origenes
Receta pinxto banderilla olmeda origenes
 
asdfasdf
asdfasdfasdfasdf
asdfasdf
 
Revolutionising the Journal through Big Data Computational Research
Revolutionising the Journal through Big Data Computational ResearchRevolutionising the Journal through Big Data Computational Research
Revolutionising the Journal through Big Data Computational Research
 
Narmada Kannan_Resume
Narmada Kannan_ResumeNarmada Kannan_Resume
Narmada Kannan_Resume
 
Daily Newsletter: 15th December, 2010
Daily Newsletter: 15th December, 2010Daily Newsletter: 15th December, 2010
Daily Newsletter: 15th December, 2010
 
Peter Kunzlik
Peter KunzlikPeter Kunzlik
Peter Kunzlik
 
From Macro to Micro: Greening Your Campus HANDOUT
From Macro to Micro: Greening Your Campus HANDOUTFrom Macro to Micro: Greening Your Campus HANDOUT
From Macro to Micro: Greening Your Campus HANDOUT
 
Applying testing mindset to software development
Applying testing mindset to software developmentApplying testing mindset to software development
Applying testing mindset to software development
 
Prueba de portada
Prueba de portadaPrueba de portada
Prueba de portada
 
Digital badging at the OU
Digital badging at the OUDigital badging at the OU
Digital badging at the OU
 
API-diskusjonen
API-diskusjonenAPI-diskusjonen
API-diskusjonen
 
National and global public inclusive infrastructures
National and global public inclusive infrastructuresNational and global public inclusive infrastructures
National and global public inclusive infrastructures
 
presentation for BPC
presentation for BPCpresentation for BPC
presentation for BPC
 
Story Testimonial Pitch
Story Testimonial PitchStory Testimonial Pitch
Story Testimonial Pitch
 
Information Architecture class13 04 10
Information Architecture class13 04 10Information Architecture class13 04 10
Information Architecture class13 04 10
 

Ähnlich wie In grammars we trust: LeadMine, a knowledge driven solution

Tackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionTackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionNextMove Software
 
Engl313 ada project4_slidedoc2 (1)
Engl313 ada project4_slidedoc2 (1)Engl313 ada project4_slidedoc2 (1)
Engl313 ada project4_slidedoc2 (1)KatieKrahn
 
1 ASSIGNMENT 1 REVIEWING RESEARCH AND MAKIN.docx
1  ASSIGNMENT 1   REVIEWING RESEARCH AND MAKIN.docx1  ASSIGNMENT 1   REVIEWING RESEARCH AND MAKIN.docx
1 ASSIGNMENT 1 REVIEWING RESEARCH AND MAKIN.docxoswald1horne84988
 
Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2ScottDorsch
 
FHIR tutorial - Afternoon
FHIR tutorial - AfternoonFHIR tutorial - Afternoon
FHIR tutorial - AfternoonEwout Kramer
 
Ethics reproducibility and data stewardship
Ethics reproducibility and data stewardshipEthics reproducibility and data stewardship
Ethics reproducibility and data stewardshipRussell Jarvis
 
FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014Ewout Kramer
 
The Killer Question(s) and Associated Experiment(s)
The Killer Question(s) and Associated Experiment(s)The Killer Question(s) and Associated Experiment(s)
The Killer Question(s) and Associated Experiment(s)CIMIT
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open DataRoss Mounce
 
BIO 1030, Principles of Biology 1 Course Description .docx
BIO 1030, Principles of Biology 1 Course Description .docxBIO 1030, Principles of Biology 1 Course Description .docx
BIO 1030, Principles of Biology 1 Course Description .docxAASTHA76
 
2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big Data2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big DataBruce Kozuma
 
Week 6 Discussion Putting it All Together - Revising the Justif.docx
Week 6 Discussion Putting it All Together - Revising the Justif.docxWeek 6 Discussion Putting it All Together - Revising the Justif.docx
Week 6 Discussion Putting it All Together - Revising the Justif.docxcockekeshia
 
Optimizing the project portfolio oracle Instantis enterprise track and crys...
Optimizing the project portfolio   oracle Instantis enterprise track and crys...Optimizing the project portfolio   oracle Instantis enterprise track and crys...
Optimizing the project portfolio oracle Instantis enterprise track and crys...p6academy
 
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docx
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docxBEM 3701, Hazardous Waste Management 1 Course Descriptio.docx
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docxAASTHA76
 
How Free is Free?: Building courses with OERs
How Free is Free?: Building courses with OERsHow Free is Free?: Building courses with OERs
How Free is Free?: Building courses with OERsBCcampus
 
Agile User Studies (Agile & Beyond 2012)
Agile User Studies (Agile & Beyond 2012)Agile User Studies (Agile & Beyond 2012)
Agile User Studies (Agile & Beyond 2012)Derek Poppink CXA CUA
 

Ähnlich wie In grammars we trust: LeadMine, a knowledge driven solution (20)

Tackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extractionTackling the difficult areas of chemical entity extraction
Tackling the difficult areas of chemical entity extraction
 
Engl313 ada project4_slidedoc2 (1)
Engl313 ada project4_slidedoc2 (1)Engl313 ada project4_slidedoc2 (1)
Engl313 ada project4_slidedoc2 (1)
 
1 ASSIGNMENT 1 REVIEWING RESEARCH AND MAKIN.docx
1  ASSIGNMENT 1   REVIEWING RESEARCH AND MAKIN.docx1  ASSIGNMENT 1   REVIEWING RESEARCH AND MAKIN.docx
1 ASSIGNMENT 1 REVIEWING RESEARCH AND MAKIN.docx
 
Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2
 
dScribe Workshop - U-M
dScribe Workshop - U-MdScribe Workshop - U-M
dScribe Workshop - U-M
 
FHIR tutorial - Afternoon
FHIR tutorial - AfternoonFHIR tutorial - Afternoon
FHIR tutorial - Afternoon
 
Ethics reproducibility and data stewardship
Ethics reproducibility and data stewardshipEthics reproducibility and data stewardship
Ethics reproducibility and data stewardship
 
FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014FHIR intro and background at HL7 Germany 2014
FHIR intro and background at HL7 Germany 2014
 
From OER to Open Culture
From OER to Open CultureFrom OER to Open Culture
From OER to Open Culture
 
Identifying Keywords and Searching Techniques
Identifying Keywords and Searching TechniquesIdentifying Keywords and Searching Techniques
Identifying Keywords and Searching Techniques
 
The Killer Question(s) and Associated Experiment(s)
The Killer Question(s) and Associated Experiment(s)The Killer Question(s) and Associated Experiment(s)
The Killer Question(s) and Associated Experiment(s)
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
BIO 1030, Principles of Biology 1 Course Description .docx
BIO 1030, Principles of Biology 1 Course Description .docxBIO 1030, Principles of Biology 1 Course Description .docx
BIO 1030, Principles of Biology 1 Course Description .docx
 
2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big Data2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big Data
 
Week 6 Discussion Putting it All Together - Revising the Justif.docx
Week 6 Discussion Putting it All Together - Revising the Justif.docxWeek 6 Discussion Putting it All Together - Revising the Justif.docx
Week 6 Discussion Putting it All Together - Revising the Justif.docx
 
Optimizing the project portfolio oracle Instantis enterprise track and crys...
Optimizing the project portfolio   oracle Instantis enterprise track and crys...Optimizing the project portfolio   oracle Instantis enterprise track and crys...
Optimizing the project portfolio oracle Instantis enterprise track and crys...
 
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docx
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docxBEM 3701, Hazardous Waste Management 1 Course Descriptio.docx
BEM 3701, Hazardous Waste Management 1 Course Descriptio.docx
 
How Free is Free?: Building courses with OERs
How Free is Free?: Building courses with OERsHow Free is Free?: Building courses with OERs
How Free is Free?: Building courses with OERs
 
Agile User Studies (Agile & Beyond 2012)
Agile User Studies (Agile & Beyond 2012)Agile User Studies (Agile & Beyond 2012)
Agile User Studies (Agile & Beyond 2012)
 
Technical Errors
Technical ErrorsTechnical Errors
Technical Errors
 

Mehr von NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]NextMove Software
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsNextMove Software
 

Mehr von NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 

Kürzlich hochgeladen

How Safe Is It To Witness Whales In Maui’s Waters
How Safe Is It To Witness Whales In Maui’s WatersHow Safe Is It To Witness Whales In Maui’s Waters
How Safe Is It To Witness Whales In Maui’s WatersMakena Coast Charters
 
Aeromexico Airlines Flight Name Change Policy
Aeromexico Airlines Flight Name Change PolicyAeromexico Airlines Flight Name Change Policy
Aeromexico Airlines Flight Name Change PolicyFlyFairTravels
 
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)Mazie Garcia
 
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's Jewel
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's JewelSicily Holidays Guide Book: Unveiling the Treasures of Italy's Jewel
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's JewelTime for Sicily
 
a presentation for foreigners about how to travel in Germany.
a presentation for foreigners about how to travel in Germany.a presentation for foreigners about how to travel in Germany.
a presentation for foreigners about how to travel in Germany.moritzmieg
 
Italia Lucca 1 Un tesoro nascosto tra le sue mura
Italia Lucca 1 Un tesoro nascosto tra le sue muraItalia Lucca 1 Un tesoro nascosto tra le sue mura
Italia Lucca 1 Un tesoro nascosto tra le sue murasandamichaela *
 
Inspirational Quotes About Italy and Food
Inspirational Quotes About Italy and FoodInspirational Quotes About Italy and Food
Inspirational Quotes About Italy and FoodKasia Chojecki
 
Revolutionalizing Travel: A VacAI Update
Revolutionalizing Travel: A VacAI UpdateRevolutionalizing Travel: A VacAI Update
Revolutionalizing Travel: A VacAI Updatejoymorrison10
 
Haitian culture and stuff and places and food and travel.pptx
Haitian culture and stuff and places and food and travel.pptxHaitian culture and stuff and places and food and travel.pptx
Haitian culture and stuff and places and food and travel.pptxhxhlixia
 
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)Escort Service
 
Where to Stay in Lagos, Portugal.pptxasd
Where to Stay in Lagos, Portugal.pptxasdWhere to Stay in Lagos, Portugal.pptxasd
Where to Stay in Lagos, Portugal.pptxasdusmanghaniwixpatriot
 
Moroccan Architecture presentation ( Omar & Yasine ).pptx
Moroccan Architecture presentation ( Omar & Yasine ).pptxMoroccan Architecture presentation ( Omar & Yasine ).pptx
Moroccan Architecture presentation ( Omar & Yasine ).pptxOmarOuazzani1
 
Hoi An Ancient Town, Vietnam (越南 會安古鎮).ppsx
Hoi An Ancient Town, Vietnam (越南 會安古鎮).ppsxHoi An Ancient Town, Vietnam (越南 會安古鎮).ppsx
Hoi An Ancient Town, Vietnam (越南 會安古鎮).ppsxChung Yen Chang
 
question 2: airplane vocabulary presentation
question 2: airplane vocabulary presentationquestion 2: airplane vocabulary presentation
question 2: airplane vocabulary presentationcaminantesdaauga
 
Authentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptxAuthentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptxGregory DeShields
 

Kürzlich hochgeladen (17)

How Safe Is It To Witness Whales In Maui’s Waters
How Safe Is It To Witness Whales In Maui’s WatersHow Safe Is It To Witness Whales In Maui’s Waters
How Safe Is It To Witness Whales In Maui’s Waters
 
Aeromexico Airlines Flight Name Change Policy
Aeromexico Airlines Flight Name Change PolicyAeromexico Airlines Flight Name Change Policy
Aeromexico Airlines Flight Name Change Policy
 
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
 
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's Jewel
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's JewelSicily Holidays Guide Book: Unveiling the Treasures of Italy's Jewel
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's Jewel
 
a presentation for foreigners about how to travel in Germany.
a presentation for foreigners about how to travel in Germany.a presentation for foreigners about how to travel in Germany.
a presentation for foreigners about how to travel in Germany.
 
Italia Lucca 1 Un tesoro nascosto tra le sue mura
Italia Lucca 1 Un tesoro nascosto tra le sue muraItalia Lucca 1 Un tesoro nascosto tra le sue mura
Italia Lucca 1 Un tesoro nascosto tra le sue mura
 
Inspirational Quotes About Italy and Food
Inspirational Quotes About Italy and FoodInspirational Quotes About Italy and Food
Inspirational Quotes About Italy and Food
 
Revolutionalizing Travel: A VacAI Update
Revolutionalizing Travel: A VacAI UpdateRevolutionalizing Travel: A VacAI Update
Revolutionalizing Travel: A VacAI Update
 
Enjoy ➥8448380779▻ Call Girls In Sector 74 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 74 Noida Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Sector 74 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 74 Noida Escorts Delhi NCR
 
Haitian culture and stuff and places and food and travel.pptx
Haitian culture and stuff and places and food and travel.pptxHaitian culture and stuff and places and food and travel.pptx
Haitian culture and stuff and places and food and travel.pptx
 
Enjoy ➥8448380779▻ Call Girls In Sector 62 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 62 Noida Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Sector 62 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 62 Noida Escorts Delhi NCR
 
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)
69 Girls ✠ 9599264170 ✠ Call Girls In East Of Kailash (VIP)
 
Where to Stay in Lagos, Portugal.pptxasd
Where to Stay in Lagos, Portugal.pptxasdWhere to Stay in Lagos, Portugal.pptxasd
Where to Stay in Lagos, Portugal.pptxasd
 
Moroccan Architecture presentation ( Omar & Yasine ).pptx
Moroccan Architecture presentation ( Omar & Yasine ).pptxMoroccan Architecture presentation ( Omar & Yasine ).pptx
Moroccan Architecture presentation ( Omar & Yasine ).pptx
 
Hoi An Ancient Town, Vietnam (越南 會安古鎮).ppsx
Hoi An Ancient Town, Vietnam (越南 會安古鎮).ppsxHoi An Ancient Town, Vietnam (越南 會安古鎮).ppsx
Hoi An Ancient Town, Vietnam (越南 會安古鎮).ppsx
 
question 2: airplane vocabulary presentation
question 2: airplane vocabulary presentationquestion 2: airplane vocabulary presentation
question 2: airplane vocabulary presentation
 
Authentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptxAuthentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptx
 

In grammars we trust: LeadMine, a knowledge driven solution

  • 1. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 In grammars we trust: LeadMine, a knowledge driven solution Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK
  • 2. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Approaches to Entity recognition • Dictionary based • Grammar based • Machine Learning LeadMineLeadMine
  • 3. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Optional
  • 4. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Normalization Input Normalized œstradiol oestradiol 5` or 5’ or 5′ (backtick/quotation mark/prime) 5' <p>H<sub>2</sub>O</p> H2O
  • 5. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Blue: Grammars Green: Traditional dictionaries Orange: Blocking dictionaries
  • 6. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Advantages of grammars • Don’t require annotated corpora • Encode knowledge about the domain • Very fast recognition • Allow spelling correction if an entity is a near match to one recognized by the grammar
  • 7. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Simple grammar Example Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’ Digit : Digit1to9 | ‘0’ Cid : ‘CID:’ Digit1to9 Digit* C I D 1..9: 0..9
  • 8. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Grammar for IUPAC names • Grammar for complete molecules: 485 rules – trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'... – ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ... • Generally aims to match a superset of the nomenclature covered by IUPAC • Specifically this is the superset that can be theoretically be converted to structures
  • 9. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Grammar inheritance • Molecule grammar serves as a good starting point for a substituent grammar or generic chemical grammar – Inherit rules rather than duplicate them – Allow overriding of rules pluralizedChemical : chemical 's' elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition metal'|'transuranic element' | _elementaryMetalAtom
  • 10. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Dictionaries… bigger is better • For high recall of trivial names, dictionaries with high coverage are required. • The largest publically available dictionary is PubChem with over 94 million terms • However most of these terms are either not useful or actually detrimental to text mining
  • 11. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Aggressive filtering • “what you don't see won't hurt you” • Hence remove terms are also English words or start with an English word – Accomplished using a large English dictionary with chemistry terms removed • Remove internal identifiers used by depositors • Remove terms that are matched by our grammars • Ultimate result: 94 million  2.94 million
  • 12. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Structure Aware filtering • “Do not tag proteins, polypeptides (> 15aa), nucleic acid polymers, polysaccharides, oligosaccharides [tetrasaccharide or longer] and other biochemicals.” • About 40,000 polypeptides and oligosaccharides excluded from PubChem using these criteria
  • 13. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Entity Extension • Even PubChem is far from comprehensive hence it can be useful to extend the start and/or end of entities to avoid partial hits – α-santalol can be recognized from santalol in the dictionary • Extension is bracketing aware and blocked by English words • Entity trimming also performed to comply with the annotation guidelines – ‘Allura Red AC dye’  ‘Allura Red AC’
  • 14. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Entity Merging • Adjacent entities may actually be part of one entity – Ethyl ester one entity – (+)-limonene epoxide  one entity BUT – Hexane-benzene two entities
  • 15. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Using an ontology to determine when terms add information • Genistein isoflavone  two entities • Glycine ester  one entity Genistein showing isoflavone core structure
  • 16. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Abbreviation detection • Based on the Hearst and Schwartz algorithm • Detects abbreviations of the following forms: – Tetrahydrofuran (THF) – THF (tetrahydrofuran) – Tetrahydrofuran (THF; – Tetrahydrofuran (THF, – (tetrahydrofuran, THF) – THF = tetrahydrofuran Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.
  • 17. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Domain-specific abbreviations • Some abbreviations are not acronyms • Can use string replacements to recognize them e.g. – Sodium  Na – Estradiol  E2 Hence can recognize: 17α-ethinylestradiol  EE2
  • 18. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Non-entity abbreviation removal • Finds entities detected as abbreviations of unrecognized entities – Can mean a common chemical abbreviation has been redefined in the scope of the document current good manufacturing practice (cGMP) cGMP = Cyclic guanosine monophosphate =
  • 19. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Making the most of the knowledge provided • Use training data to identify: – Terms that are not currently recognized (whitelist) – Terms that are often false positives (blacklist) • Each false positive and false negative is placed into such a list if its inclusion increased F-score (harmonic mean of precision and recall)
  • 20. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 CEM Task Results (on development set) Configuration Precision Recall F-score Baseline 0.87 0.82 0.84 WhiteList 0.86 0.85 0.86 BlackList 0.88 0.80 0.84 WhiteList + BlackList 0.87 0.83 0.85
  • 21. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 CDI task ranking • Uses precision of entities when running against the development set with the results broken down by: – Title vs abstract? – Which dictionary matched? – Was the entity’s bounds modified? – Did the entity occur more than once in the document?
  • 22. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Conclusions • Grammars complement dictionaries to allow recognition of novel entities • Both the coverage and quality of dictionaries is important • The meaning of novel abbreviations can be determined algorithmically • Entities can be classified based on the resource that recognized them
  • 23. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com