Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
In grammars we trust: LeadMine,
a...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Approaches to Entity
recognition
...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Optional
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Normalization
Input Normalized
œs...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Blue: Grammars
Green: Traditional...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Advantages of grammars
• Don’t re...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Simple grammar Example
Digit1to9 ...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Grammar for IUPAC names
• Grammar...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Grammar inheritance
• Molecule gr...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Dictionaries… bigger is better
• ...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Aggressive filtering
• “what you ...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Structure Aware filtering
• “Do n...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Entity Extension
• Even PubChem i...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Entity Merging
• Adjacent entitie...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Using an ontology to determine
wh...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Abbreviation detection
• Based on...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Domain-specific abbreviations
• S...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Non-entity abbreviation
removal
•...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Making the most of the
knowledge ...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
CEM Task Results
(on development ...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
CDI task ranking
• Uses precision...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Conclusions
• Grammars complement...
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013
Thank you for your time!
http://n...
Upcoming SlideShare
Loading in …5
×

of

In grammars we trust: LeadMine, a knowledge driven solution Slide 1 In grammars we trust: LeadMine, a knowledge driven solution Slide 2 In grammars we trust: LeadMine, a knowledge driven solution Slide 3 In grammars we trust: LeadMine, a knowledge driven solution Slide 4 In grammars we trust: LeadMine, a knowledge driven solution Slide 5 In grammars we trust: LeadMine, a knowledge driven solution Slide 6 In grammars we trust: LeadMine, a knowledge driven solution Slide 7 In grammars we trust: LeadMine, a knowledge driven solution Slide 8 In grammars we trust: LeadMine, a knowledge driven solution Slide 9 In grammars we trust: LeadMine, a knowledge driven solution Slide 10 In grammars we trust: LeadMine, a knowledge driven solution Slide 11 In grammars we trust: LeadMine, a knowledge driven solution Slide 12 In grammars we trust: LeadMine, a knowledge driven solution Slide 13 In grammars we trust: LeadMine, a knowledge driven solution Slide 14 In grammars we trust: LeadMine, a knowledge driven solution Slide 15 In grammars we trust: LeadMine, a knowledge driven solution Slide 16 In grammars we trust: LeadMine, a knowledge driven solution Slide 17 In grammars we trust: LeadMine, a knowledge driven solution Slide 18 In grammars we trust: LeadMine, a knowledge driven solution Slide 19 In grammars we trust: LeadMine, a knowledge driven solution Slide 20 In grammars we trust: LeadMine, a knowledge driven solution Slide 21 In grammars we trust: LeadMine, a knowledge driven solution Slide 22 In grammars we trust: LeadMine, a knowledge driven solution Slide 23
Upcoming SlideShare
Developing sustainable business models for institutions’ provision of open educational resources: Learning from OpenLearn users
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

In grammars we trust: LeadMine, a knowledge driven solution

Download to read offline

We present a system employing large grammars and dictionaries to recognize a broad range of chemical entities. The system utilizes these re-sources to identify chemical entities without an explicit tokenization step. To al-low recognition of terms slightly outside the coverage of these resources we employ spelling correction, entity extension, and merging of adjacent entities. Recall is enhanced by the use of abbreviation detection and precision is en-hanced by the removal of abbreviations of non-entities. With the use of training data to produce further dictionaries of terms to recognize/ignore our system achieved 86.2% precision and 85.0% recall on an unused development set.

  • Be the first to like this

In grammars we trust: LeadMine, a knowledge driven solution

  1. 1. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 In grammars we trust: LeadMine, a knowledge driven solution Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK
  2. 2. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Approaches to Entity recognition • Dictionary based • Grammar based • Machine Learning LeadMineLeadMine
  3. 3. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Optional
  4. 4. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Normalization Input Normalized œstradiol oestradiol 5` or 5’ or 5′ (backtick/quotation mark/prime) 5' <p>H<sub>2</sub>O</p> H2O
  5. 5. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Blue: Grammars Green: Traditional dictionaries Orange: Blocking dictionaries
  6. 6. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Advantages of grammars • Don’t require annotated corpora • Encode knowledge about the domain • Very fast recognition • Allow spelling correction if an entity is a near match to one recognized by the grammar
  7. 7. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Simple grammar Example Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’ Digit : Digit1to9 | ‘0’ Cid : ‘CID:’ Digit1to9 Digit* C I D 1..9: 0..9
  8. 8. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Grammar for IUPAC names • Grammar for complete molecules: 485 rules – trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'... – ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ... • Generally aims to match a superset of the nomenclature covered by IUPAC • Specifically this is the superset that can be theoretically be converted to structures
  9. 9. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Grammar inheritance • Molecule grammar serves as a good starting point for a substituent grammar or generic chemical grammar – Inherit rules rather than duplicate them – Allow overriding of rules pluralizedChemical : chemical 's' elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition metal'|'transuranic element' | _elementaryMetalAtom
  10. 10. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Dictionaries… bigger is better • For high recall of trivial names, dictionaries with high coverage are required. • The largest publically available dictionary is PubChem with over 94 million terms • However most of these terms are either not useful or actually detrimental to text mining
  11. 11. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Aggressive filtering • “what you don't see won't hurt you” • Hence remove terms are also English words or start with an English word – Accomplished using a large English dictionary with chemistry terms removed • Remove internal identifiers used by depositors • Remove terms that are matched by our grammars • Ultimate result: 94 million  2.94 million
  12. 12. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Structure Aware filtering • “Do not tag proteins, polypeptides (> 15aa), nucleic acid polymers, polysaccharides, oligosaccharides [tetrasaccharide or longer] and other biochemicals.” • About 40,000 polypeptides and oligosaccharides excluded from PubChem using these criteria
  13. 13. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Entity Extension • Even PubChem is far from comprehensive hence it can be useful to extend the start and/or end of entities to avoid partial hits – α-santalol can be recognized from santalol in the dictionary • Extension is bracketing aware and blocked by English words • Entity trimming also performed to comply with the annotation guidelines – ‘Allura Red AC dye’  ‘Allura Red AC’
  14. 14. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Entity Merging • Adjacent entities may actually be part of one entity – Ethyl ester one entity – (+)-limonene epoxide  one entity BUT – Hexane-benzene two entities
  15. 15. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Using an ontology to determine when terms add information • Genistein isoflavone  two entities • Glycine ester  one entity Genistein showing isoflavone core structure
  16. 16. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Abbreviation detection • Based on the Hearst and Schwartz algorithm • Detects abbreviations of the following forms: – Tetrahydrofuran (THF) – THF (tetrahydrofuran) – Tetrahydrofuran (THF; – Tetrahydrofuran (THF, – (tetrahydrofuran, THF) – THF = tetrahydrofuran Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.
  17. 17. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Domain-specific abbreviations • Some abbreviations are not acronyms • Can use string replacements to recognize them e.g. – Sodium  Na – Estradiol  E2 Hence can recognize: 17α-ethinylestradiol  EE2
  18. 18. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Non-entity abbreviation removal • Finds entities detected as abbreviations of unrecognized entities – Can mean a common chemical abbreviation has been redefined in the scope of the document current good manufacturing practice (cGMP) cGMP = Cyclic guanosine monophosphate =
  19. 19. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Making the most of the knowledge provided • Use training data to identify: – Terms that are not currently recognized (whitelist) – Terms that are often false positives (blacklist) • Each false positive and false negative is placed into such a list if its inclusion increased F-score (harmonic mean of precision and recall)
  20. 20. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 CEM Task Results (on development set) Configuration Precision Recall F-score Baseline 0.87 0.82 0.84 WhiteList 0.86 0.85 0.86 BlackList 0.88 0.80 0.84 WhiteList + BlackList 0.87 0.83 0.85
  21. 21. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 CDI task ranking • Uses precision of entities when running against the development set with the results broken down by: – Title vs abstract? – Which dictionary matched? – Was the entity’s bounds modified? – Did the entity occur more than once in the document?
  22. 22. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Conclusions • Grammars complement dictionaries to allow recognition of novel entities • Both the coverage and quality of dictionaries is important • The meaning of novel abbreviations can be determined algorithmically • Entities can be classified based on the resource that recognized them
  23. 23. BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com

We present a system employing large grammars and dictionaries to recognize a broad range of chemical entities. The system utilizes these re-sources to identify chemical entities without an explicit tokenization step. To al-low recognition of terms slightly outside the coverage of these resources we employ spelling correction, entity extension, and merging of adjacent entities. Recall is enhanced by the use of abbreviation detection and precision is en-hanced by the removal of abbreviations of non-entities. With the use of training data to produce further dictionaries of terms to recognize/ignore our system achieved 86.2% precision and 85.0% recall on an unused development set.

Views

Total views

2,310

On Slideshare

0

From embeds

0

Number of embeds

1,652

Actions

Downloads

7

Shares

0

Comments

0

Likes

0

×