Sujan Perera's Dissertation Defense: Friday, August 12, 2016
Ph.D. Committee: Drs. Amit Sheth, Advisor; T.K. Prasad, Michael Raymer, and Pablo Mendes (IBM Research)
Video: https://youtu.be/pbjJ1zb8ayY
ABSTRACT:
Natural language is a powerful tool developed by humans over hundreds of thousands of years. The extensive usage, flexibility of the language, creativity of the human beings, and social, cultural, and economic changes that have taken place in daily life have added new constructs, styles, and features to the language. One such feature of the language is its ability to express ideas, opinions, and facts in an implicit manner. This is a feature that is used extensively in day to day communications in situations such as: 1) expressing sarcasm, 2) when trying to recall forgotten things, 3) when required to convey descriptive information, 4) when emphasizing the features of an entity, and 5) when communicating a common understanding.
Consider the tweet 'New Sandra Bullock astronaut lost in space movie looks absolutely terrifying' and the text snippet extracted from a clinical narrative 'He is suffering from nausea and severe headaches. Dolasteron was prescribed.' The tweet has an implicit mention of the entity Gravity and the clinical text snippet has implicit mention of the relationship between medication Dolasteron and clinical condition nausea. Such implicit references of the entities and the relationships are common occurrences in daily communication and they add unique value to conversations. However, extracting implicit constructs has not received enough attention. This dissertation focuses on extracting implicit entities and relationships from clinical narratives and extracting implicit entities from Tweets.
This dissertation demonstrates manifestations of implicit constructs in text, studies their characteristics, and develops a solution that is capable of extracting implicit factual information from text. The developed solution starts by acquiring relevant knowledge to solve the implicit information extraction problem. The relevant knowledge includes domain knowledge, contextual knowledge, and linguistic knowledge. The acquired knowledge can take different syntactic forms such as a text snippet, structured knowledge represented in standard knowledge representation languages like Resource Description Framework (RDF) or custom formats. Hence, the acquired knowledge is processed to create models that can be understood by machines. Such models provide the infrastructure to perform implicit information extraction of interest.
This dissertation focuses on three different use cases of implicit information and demonstrates the applicability of the developed solution in these use cases. They are:
- implicit entity linking in clinical narratives,
- implicit entity linking in Twitter,
- implicit relationship extraction from clinical narratives.
2. 2
Information Extraction
• More than 70% of data in organizations exist in unstructured form1
• Extraction of structured information from unstructured data is a
fundamental task
“All home medications although his insulin
dose (nph 20 qPM) was halved (--> NPH
10 qPM) on the floor, and his sugars were
running in the 150s-250s range.”
Insulin
Cisapride
Diabetes Mellitus
Hyperglycemia
Proinsulin
Porcine Insulin Insulin Glulisine
is a is a
is a
1https://en.wikipedia.org/wiki/Unstructured_data
3. 3
Information Extraction
• Almost exclusively focused on explicit information
“Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac
catheterization because of a positive exercise tolerance test. Recently, he started
to have left shoulder twinges and tingling in his hands. A stress test done on
2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to
fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed
accumulation of fluid in his extremities. He does not have any chest pain.”
4. 4
Information Extraction
• Almost exclusively focused on explicit information
Named Entity Recognition Relationship Extraction
Entity Linking
“Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac
catheterization because of a positive exercise tolerance test. Recently, he started
to have left shoulder twinges and tingling in his hands. A stress test done on
2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to
fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed
accumulation of fluid in his extremities. He does not have any chest pain.”
Person Person C0018795
C0015672
C0008031
5. 5
Information Extraction
• Misses the implicit information
“Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac
catheterization because of a positive exercise tolerance test. Recently, he started
to have left shoulder twinges and tingling in his hands. A stress test done on
2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to
fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed
accumulation of fluid in his extremities. He does not have any chest pain.”
Person Person C0018795
C0015672
C0008031
No shortness of breath
edema
Named Entity Recognition Relationship Extraction
Entity Linking Implicit information extraction
6. 6
Thesis Statement
Implicit factual information in unstructured text can be efficiently
extracted by bridging syntactic and semantic gaps in natural language
usage and augmenting information extraction techniques with
relevant domain knowledge.
7. 7
• Express sarcasm/sentiment
• “I'm striving to be positive in what I say on Twitter. So I'll refrain
from making a comment about the latest Michael Bay movie.”
• Provide descriptive information
• “small fluid adjacent to the gallbladder with gallstones which may
represent inflammation”
• Emphasize features of the entity
• “Mason Evans 12 year long shoot won big in golden globe”
• Communicate the common understanding
• “He is suffering from nausea and severe headaches. Dolasteron was
prescribed.”
• Stylistic Preferences
• “Democratic candidate Bernie Sanders … The Vermont senator …”
Credit:http://bit.ly/2b9Bnjk
8. 8
Significance
• Volume
• 20% movie references and 40% book references in tweets
• 35% edema and 40% shortness of breath references in clinical
narratives
• Value
Explicit Information
Computer Assisted Coding
30-day Readmission
Prediction
Sentiment Analysis
Structured
Information
9. 9
Significance
• Volume
• 20% movie references and 40% book references in tweets
• 35% edema and 40% shortness of breath references in clinical
narratives
• Value
Ignoring implicit information in text would adversely affect
downstream applications
Explicit Information
Implicit Information
Computer Assisted Coding
30-day Readmission
Prediction
Sentiment Analysis
Structured
Information
10. 10
Role of Knowledge
New Sandra Bullock astronaut lost in space
movie looks absolutely terrifying
The patient showed accumulation of fluid in his
extremities, but respirations were unlabored and there
were no use of accessory muscles.
Edema Accumulation of an excessive amount
of watery fluid in cells or intercellular
tissues
Shortness
of breath
Labored or difficult breathing
associated with a variety of disorders
Sandra
Bullock
Gravity
Knowledge Bases
Image credits: http://bit.ly/2b5HPDQ and Icon made by Freepik from www.flaticon.com
Credit: http://bit.ly/2bi34FGCredit: http://bit.ly/1x3sack Credit: http://bit.ly/2b9CejW Credit: http://bit.ly/2aXM97v
12. 12
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narratives Tweets
Disorders Symptoms Movies Books
Clinical Narratives
Disorders and Symptoms
13. 13
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narratives Tweets
Disorders Symptoms Movies Books
Clinical Narratives
Disorders and Symptoms
14. 14
Sentence Entity
“small fluid adjacent to the gallbladder with gallstones which may represent
inflammation.”
Cholecystitis
“His tip of the appendix is inflamed.” Appendicitis
“The respirations were unlabored and there were no use of accessory muscles.” Shortness of breath (NEG)
Implicit Entities in Clinical Documents
• One should know the physiological observations that characterize
particular entity
• Negations are embedded in the phrases indicating entities
• “Patient denies shortness of breath”
• “The respirations were unlabored”
15. 15
Knowledge Acquisition
• Unified Medical Language System – integrate many health and
biomedical vocabularies
• Linguistic Knowledge – WordNet
• Synonyms/antonyms
• Syntactic variations of the same term
CUI AUI STR
CUI TUI
CUI STR DEF SAB
Definitions for shortness of breath
A disorder characterized by an uncomfortable sensation of difficulty
breathing
Difficult or labored breathing
Labored or difficulty breathing associated with a variety of disorders,
indicating inadequate ventilation or low blood oxygen or a subjective
experience of breathing discomfort
16. 16
Knowledge Modeling
• Each entity has multiple definitions
• Each definition is processed to create entity indicator
• Representative power of the term (r1) calculated with
measure inspired by TF-IDF
• A collection of entity indicators constitute entity
model
definition1
definition2
definition3
Entity
Indicator1
Entity
Indicator2
Entity
Indicator3
Entity Model
Definition Entity Indicator
A disorder characterized by an uncomfortable
sensation of difficulty breathing
(uncomfortable, r1), (sensation, r2),
(difficulty, r3), (breathing, r4)
Difficult or labored breathing (difficult, r5), (labored, r6), (breathing, r4)
17. 17
Detecting Sentences with Implicit Entities
• The sentences with entity representative term but without the
entity name may have implicit mention of the entity.
“However, Mr. Smith is comfortably breathing in room air.”
Candidate sentence for shortness of breath
18. • The similarity between entity model and the pruned sentence is
measured to annotate them with positive or negative labels
• We developed a semantic similarity measure that takes care of the
synonyms and antonyms
18
Information Extraction – Entity Linking
Candidate Sentence
Indicator1
Indicator2
Indicator3
Entity Model
sim1
sim2
sim3
19. 19
Information Extraction – Entity Linking
ct1
ct2
ct3
ct4
et5
et6
et7
Candidate Sentence Entity Indicator
WordNet
If antonym then -1
else max similarity
𝑠𝑖𝑚 ∗ 𝑟𝑝 𝑒𝑡
𝑟𝑝 𝑒𝑡
>t1
<t2
Positive Annotation
Negative Annotation
20. 20
Evaluation
• Re-annotated the SemEval-
2014 task 7 dataset for implicit
entities
• Entities are selected
considering the frequency of
appearance and with expert
feedback
• 857 sentences selected for 8
entities
• Annotated by three domain
experts
• Annotation agreement 0.58
Entity Positive
Annotations
Negative
Annotations
Shortness of Breath 93 94
Edema 115 35
Syncope 96 92
Cholecystitis 78 36
Gastrointestinal Gas 18 14
Colitis 12 11
Cellulitis 8 2
Fasciitis 7 3
21. 21
Algorithm Positive
Precision
Positive
Recall
Positive
F1
Negative
Precision
Negative
Recall
Negative
F1
Our 0.66 0.87 0.75 0.73 0.73 0.73
MCS 0.50 0.93 0.65 0.31 0.76 0.44
SVM 0.73 0.82 0.77 0.66 0.67 0.67
Adding similarity value as a feature for the supervised algorithm
SVM+MCS 0.73 0.82 0.77 0.66 0.66 0.66
SVM+Our 0.77 0.85 0.81 0.72 0.75 0.73
• Baselines
• MCS algorithm (Mihalcea 2006)
• SVM (trained on n-grams)
• Our algorithm outperforms selected baselines in negative category.
• SVM is able to leverage the supervision to beat our algorithm in
positive category.
Annotation Performance
22. 22
Similarity as a Feature to Supervised Algorithm
• Added similarity value of unsupervised algorithms as a feature to the
SVM.
Positive Annotations Negative Annotations
23. 23
Annotation Performance – A Study with the Confidence
• Each annotation has
confidence ranges from 1 to 5
• Low confidence reflects
incomplete or ambiguous
information
• Annotation performance
increases as the confidence
increases
• The negative class shows
significant increment
24. 24
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narratives Tweets
Disorders Symptoms Movies Books
Clinical Narratives
Disorders and Symptoms
25. 25
• Use diverse characteristics of the entity
– “New Sandra Bullock astronaut lost in space movie looks absolutely
terrifying”
– “ISRO sends probe to Mars for less money than it takes Hollywood to send a
woman to space.”
– “oh yeah there is that new space movie coming out that looks terrifying i am
going to go see it”
• Use time-sensitive phrases
Furious 7Gravity The Martian
Fall 2013 April 2014 Fall 2015
space movie
fastest movie to
earn $1 billion
Paul walkers’
last movie
Tweets with Implicit Entities
Credit: http://bit.ly/2bkePJ6
26. 26
• Use diverse characteristics of the entity
– “… Richard Linklater movie …”
– “… Ellar Coltrane on his 12-year movie …”
– “… 12-year long movie shoot …”
– “… Mason Evan's childhood movie …”
• Use time-sensitive phrases
Furious 7Gravity The Martian
Fall 2013 April 2014 Fall 2015
space movie
fastest movie to
earn $1 billion
Paul walkers’
last movie
Tweets with Implicit Entities
Credit: http://bit.ly/2bk8xdp
27. 27
Knowledge Acquisition
• Acquiring factual knowledge
• Source – DBpedia
• Not all factual knowledge is important – movie has ‘starring’ and
‘director’ as well as ‘billed‘ and ‘license’
• Rank the relationships based on joint probability with the entity type
• Values of top-k relationships and the value of rdfs:comment are obtained
• Acquiring contextual knowledge
• Source – contemporary tweets
• We collect 1000 tweets with explicit mentions of the entity
• Number of views for the entity’s Wikipedia page within last t days
28. 28
Knowledge Acquisition
Wikipedia page titles
and anchor texts
Contemporary
tweets
Generate
semantic cues
Factual
knowledge
Clean tweets
Generate n-grams
• Need to extract meaningful phrases from acquired
knowledge
• Meaningful phrases = Wikipedia titles + anchor texts
• Matching n-grams are added to semantic cues
• Non-matching n-grams are added to semantic cues
after removing stop words
29. 29
Knowledge Modeling – Entity Model Network
Sandra Bullock
Alfonso Curan
Mars orbiter mission
Woman in space
astronaut
• A property graph - reflecting the topical relationships between entities
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
|𝑁|
|𝑁𝑐 𝑗
|
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑝ℎ𝑟𝑎𝑠𝑒 𝑖𝑛 𝑡𝑤𝑒𝑒𝑡𝑠
𝑡𝑖𝑚𝑒 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 = number
of Wikipedia views
𝑁 − 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠, 𝑁𝑐 𝑗
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑑𝑗𝑎𝑐𝑒𝑛𝑡 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠
Factual Knowledge
Contextual Knowledge
Entity
Gravity
Christopher Nolan
Matt Damon
Interstellar
The Martian
30. 30
Detecting Tweets with Implicit Entities
• Tweets are filtered with keywords – movie, film, book, novel
• Applied simple annotation technique – dictionary matching
• The tweets that are not annotated with entity of types we are
looking for are considered to have implicit entity mentions
Keywords
Entity
Dictionary
Annotating
Tweets
31. 31
Information Extraction – Entity Linking
• Two Step Process
• Step 1: Candidate selection and filtering
• Objective - prune the search space to reduce number of entities to be
considered in disambiguation step from EMN
• Step 2: Disambiguation
• Objective - sort the selected candidate entities to place the implicitly
mentioned entity in top position
32. 32
Entity Linking - Candidate selection and filtering
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c4
c6
c3
c2
c9
c7
“ISRO sends probe to Mars for less money
than it takes Hollywood movie to send a
woman to space”
m8
EntityFactual Knowledge Contextual Knowledge
33. 33
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c6
c3
c2
c9
c7
“ISRO sends probe to Mars for less money
than it takes Hollywood movie to send a
woman to space” c5
c2 c7
c8
m8
Factual Knowledge Contextual Knowledge Entity
Entity Linking - Candidate selection and filtering
c4
35. 35
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c6
c3
c2
c9
c7
c5
c2
m1
m2
m4
m5
m3
𝑠𝑐𝑜𝑟𝑒 𝑚 𝑖
=
𝑐 𝑗 𝜖 ℂ
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑓 𝑐𝑗 ∗ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑐𝑗, 𝑚𝑖)
c7
c8
m6
m7
m2
m4
m6
m7
m3
ℂ is the set of matching cues
m8
Factual Knowledge Contextual Knowledge Entity
“ISRO sends probe to Mars for less money
than it takes Hollywood movie to send a
woman to space”
Entity Linking - Candidate selection and filtering
c4
36. 36
• Formulated as a ranking problem
• SVMrank to rank candidates
• Similarity between the candidate entity and the tweet
• Temporal salience of the candidate entity
x1 x2 x3 … xn
xj= 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑓 𝑐𝑗 ∗ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑐𝑗, 𝑒𝑖)
𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 𝑒𝑖
𝑒∈𝐸 𝑐
𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 𝑒
𝐸𝑐 is the selected candidate set
m2
m6
m4m3
m7
Entity Linking - Disambiguation
37. Evaluation Dataset
Entity Type Annotation Tweets Entity
Movie Explicit 391 107
Implicit 207 54
NULL 117 0
Book Explicit 200 24
Implicit 190 53
NULL 70 0
• Tweets are collected in August 2014 using keywords
• Manually annotated the tweets with DBpedia URL of entities
• The tweets annotated with NULL do not have either explicit or implicit
mention of an entity
37
38. Entity Model Network Creation
• 15,000 tweets for movies and books in July 2014
• 617 movies and 102 books
• Recent 1000 tweets per entity to build its contextual knowledge
• May 2014 version of DBpedia used to extract factual knowledge
• Temporal salience is obtained for July 2014
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c4
c6
c3
c2
c9
c7
38Factual Knowledge Contextual Knowledge Entity
39. • How many tweets had correct entity within selected candidate set (top-25)?
• How many entities were correctly linked by our disambiguation approach?
• Importance of contextual knowledge
Evaluation - Implicit Entity Linking
Entity Type Candidate Selection Recall Disambiguation accuracy
Movie 90.33% 60.97%
Book 94.73% 61.05%
39
Step Entity Type Without Contextual
Knowledge
With Contextual
Knowledge
Candidate Selection
Recall
Movie 77.29% 90.33%
Book 76.84% 94.73%
Disambiguation
Accuracy
Movie 51.7% 60.97%
Book 50.0% 61.05%
40. Qualitative Error Analysis
Error Tweet Entity
Lack of contextual
knowledge
‘That Movie Where Shailene Woodley Has Her First Nude Scene?
The Trailer Is RIGHT HERE!: No one can say Shailene Woodley isn't
brave!’
White Bird in a Blizzard
Novel entities ‘”hey, what's wrawng widdis goose?" RT @TIME: Mark Wahlberg
could be starring in a movie about the BP oil spill
http://ti.me/1oZh55V'
Deepwater Horizon
Cold start of entities ‘Video: George R.R. Martin's Children's Book Gets Re-release
http://bit.ly/1qNNH5r’
The Ice Dragon
Multiple implicit entity
mentions
‘That moment when you realize that hazel grace and Augustus are
brother and sister in one movie and in love battling cancer in
another’
Divergent, The Fault in
Our Stars
40
41. 41
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narratives Tweets
Disorders Symptoms Movies Books
Clinical Narratives
Disorders and Symptoms
43. 43
• Implicit relationships:
• Exist between symptoms, disorders, medications, and procedures
• Can be established by leveraging domain knowledge
• The existing knowledge bases fall short in eliciting relationships
• Data + Knowledge can help to elicit such implicit relationships
efficiently
Implicit Relationships in Clinical Narratives
45. 45
A Scenario
Atrial fibrillation
Hypertension
Diabetes
Fatigue
Syncope
Weight loss
Chest pain
Discomfort in chest
Dizzy
Shortness of Breath
Nausea
Vomiting
Headache
Cough
Weight gain
Atrial fibrillation
Hypertension
Diabetes
Chest pain
Weight gain
Discomfort in chest
Cough
Headache
Edema
Shortness of Breath
Knowledge base does not know about edema. Now edema can
be a symptom of any disorder in the document.
Observed
Disorders
Observed
Symptoms
46. 46
Knowledge Acquisition
• Hierarchical knowledge and non-hierarchical knowledge
Hierarchical Knowledge
Retrieved from UMLS
Non-hierarchical Knowledge
Extracted from Web Resources
+
Feedback from domain expert
www.nlm.nih.gov www.en.wikipedia.org
www.webmd.com www.mayoclinic.com
www.clevelandclinic.org ww.healthline.org
CUI AUI PAUI PTR
C0013404 A0052186 A0111363 A0434168.A2367943. …
C0013604 A0052723 A0135504 A0434168.A2367943
CUI AUI SAB STR
C0013404 A0052186 MSH Shortness of breath
C0013604 A0052723 MSH Edema
MRHIER
MRCONSO
48. 48
Detecting Unexplained Symptoms
• Clinical documents were semantically annotated for entities using
cTAKES
• Known relationships are populated
• Unexplained symptoms were detected Modeled
Knowledge
Credit:http://bit.ly/2aMWVAd
49. 49
Information Extraction – Unknown Relationships
• Naïve method would assume relationship between unexplained
symptom and all disorders in clinical narrative
• Can we leverage the knowledge we have about symptom to find
most plausible disorders?
• Intuition: a symptom is most likely to be shared by similar disorders
54. 5. Eliminate non-
matching candidate
disorders
54
S
D2
D4
We left with most plausible disorders for unexplained symptom. If
this scenario occurs frequently, it increases the confidence on this
relationship.
Information Extraction – Unknown Relationships
55. 55
Evaluation
• A corpus of 1,500 electronic medical records were used
• Annotated with cTAKES and selected the most frequent entities
were selected
• UMLS semantic types were used to categorize disorders and
symptoms
• Initial knowledge base - 86 disorders, 42 symptoms, 255 disorder-
symptom relationships
56. • There were 29 distinct unexplained symptoms
• Precision of the questions generated
• 1st iteration - 105 correct from 142 (73.94%)
• 2nd iteration - 20 correct from 29 (68.96%)
• 3rd iteration - 4 correct from 9 (44.44%)
56
Evaluation – Relationship Prediction
Symptom Number of unexplained
instances
Edema 910
Syncope 336
Systolic Murmur 168
Tachycardia 143
Angina 136
Disorder Number of co-
occurrences
Hypertension 647
Hyperlipidemia 641
Claudication 454
Coronary atherosclerosis 395
Coronary artery disease 242
Top 5 unexplained symptom Top 5 co-occurring disorders with edema
57. 57
Evaluation – Increment in Explainability
Knowledge base Number of unexplained
relationships
Increment in explainability
Initial knowledge base 2251 0%
After 1st iteration 878 60.99%
After 2nd iteration 806 64.19%
58. 58
Summary
• Implicit information is frequent occurrence in text and ignoring them
would adversely affect downstream applications.
• Linguistic and world Knowledge plays an important role in decoding
implicit information.
• This dissertation demonstrated characteristics of implicit information
and developed solution to capture factual implicit constructs.
Knowledge Acquisition Knowledge Modeling Detecting Implicit
Information
Information Extraction
59. 59
Contributions
• Identify and demonstrate the value of implicit information.
• Study the characteristics of the implicit information manifestation.
• Demonstrate the value of knowledge in extracting factual implicit
information.
- Linguistic - Domain - Contextual
• Developed a framework for factual implicit information extraction.
• Demonstrated the usage of the framework to solve three implicit
information extraction problems.
60. 60
Graduate Life@Kno.e.sis
Journal Publications:
• Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suhas
Nair, Semantics Driven Approach for Knowledge Acquisition from EMRs, IEEE Journal of
Biomedical and Health Informatics.
• Raminta Daniulaityte, Robert Carlson, Russel Falck, Delroy Cameron, Sujan Perera, Lu
Chen and Amit Sheth. I just wanted to tell you that loperamide WILL WORK': A Web-
Based Study of Extra-Medical Use of Loperamide.
Conference Publications:
• Sujan Perera, Pablo Mendes, Adarsh Alex, Amit Sheth, Krishnaprasad
Thirunarayan, Implicit Entity Linking in Tweets, ESWC 2016
• Sujan Perera, Pablo Mendes, Amit Sheth, Krishnaprasad Thirunarayan, Adarsh Alex,
Christopher Heid, Greg Mott, Implicit Entity Recognition in Clinical Documents, *SEM
2015
• Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suhas Nair, Data
Driven Knowledge Acquisition Method for Domain Knowledge Enrichment in the
Healthcare, BIBM 2012
• Menasha Thilakaratne, Ruvan Weerasinghe, Sujan Perera, Knowledge-driven Approach
to Predict Personality Traits by Leveraging Social Media Data, WI 2016
Workshop and Posters:
• Sujan Perera, Amit Sheth, Krishnaprasad Thirunarayan, Challenges in Understanding
Clinical Notes: Why NLP Engines Fall Short and Where Background Knowledge Can
Help, DARE 2013
• Raminta Daniulaityte, Robert Carlson, Russel Falck, Delroy Cameron, Sujan Perera, Lu
Chen, Amit Sheth. A Web-Based Study of Self-Treatment of Opioid Withdrawal
Symptoms with Loperamide, CPDD 2012
Internships:
• ezDI Summer 2012
• IBM Watson Summer 2014 and 2015
Awards and grants:
• George Thomas Graduate Fellowship
• NSF travel grants: BIBM and ICHI
PC Committee:
• DARE (2013), EKAW (2014, 2016), ISWC
2015, IJCAI 2016
External Reviewer:
• ISWC, ESWC, IJSWIS, IEEE Intelligent
Systems, Applied Ontology, ODBASE
Proposal Contributions:
• eDrugTrends (NIH R01)
• Healthcare Outcome Prediction (NSF-SCH)
Mentoring:
• Adarsh Alex (MSc)
• Menasha Tilakaratne (BSc)
62. 62
Coffee Mates and Colleagues
Thank You
Funding
• ezDI
• George Thomas Fellowship
• NSF: CNS 1513721 Context-
Aware Harassment Detection
on Social Media
Hinweis der Redaktion
Hi Good morning everyone, thanks for attending my dissertation defense.
My dissertation topic is “Knolwedge-driven Implicit Information Extraction”
My dissertation committee is Dr. Sheth, my advisor, Dr. Prasad, Dr. Raymer, and joining us remotely Dr. Mendes from IBM research
It is a well-known fact that more than 70% of data are unstructured
The field of information extraction focuses on extracting structured information from unstructured data
The information extraction can add more semantics to the text
For example, when we know that the insulin here is the medication and not a regular English term we know more about the text
We know it is probably treating diabetes and it should not be used with Cisapride
We have exclusively focused on explicit information extraction
Here is a text snippet extracted from a clinical narrative
With the current information extraction techniques we can identify these entities and relationships, namely we can perform named entity recognition, entity linking, and relationship extraction on this text
However, this snippet had these two phrases and they say that the patient does not have shortness of breath and he has edema.
These were implicit and we don’t know how to identify these implicit entities in this text
This is the problem I address in my dissertation work
My thesis statement is …
Implicit communication is not a random occurrence, people have genuine reasons to do so
The first example here has negative sentiment towards Michael Bay movie and sarcastic in nature
Sometimes the terms used to indicate entities does not provide complete and important information, the term cholecystitis does not provide enough information in second example, having better idea about the causes helps accurate treatments. However, in medical domain it is important to know these nuances
We use them when we forget stuff and to emphasize features of the entity rather than the entity
One more instance is when we know other party has required knowledge to interpret the statements
Implicit information has enough volume to get notice and a value, at least on the domain we looked at.
Our studies showed that some entities are mentioned implicitly between 20%-40% of the time
In terms of value, extracting structured information is very important to applications like computer assisted coding an secondary data analysis like prediction tasks on medical data and tasks like sentiment analysis on tweets
But all of them were functioning without the implicit information
Sometimes, they were handicapped
For example a sentiment analysis application would not know the target of the positive sentiment in the tweet “new Sandra Bullock space movie is terrific”
Hence ignoring implicit entity mentions would adversely affect downstream applications
Whenever we talk implicitly we assume a common understanding between us and the audience
For example, the first scenario here, the speaker assumes that audience know about physiological observations that characterizes edema and shortness of breath
In the second scenario, speaker assumes that audience know Sandra Bullock starred in Gravity
These knowledge is available in different formats.
Such resources contain linguistic knowledge, common sense knowledge, domain-specific and cross-domain world knowledge
http://icons.mysitemyway.com/legacy-icon/115791-magic-marker-icon-people-things-people-audience/
Icon made by Freepik from www.flaticon.com
In this dissertation I address the problem of implicit information extraction.
Here are the four main components of the solution I developed.
As demonstrated domain knowledge is an important component, hence the solution acquire relevant domain knowledge.
The acquired knowledge does not always readable by machine algorithms
Hence we model it in a way that can be consumed by algorithms.
Then given a text it identifies whether it has implicit information of interest and develop techniques to extract such information.
Within this talk I will demonstrate 3 applications of implicit information extraction and show how each of these high level components are realized to solve the problem at hand.
Before going into the solutions, let me define the focus of the dissertation
There are so many things can be expressed implicitly: ideas, opinions, and facts.
Before going into the solutions, let me define the focus of the dissertation
There are so many things can be expressed implicitly: ideas, opinions, and facts.
What we observe is that when the entities are mentioned implicitly in clinical documents they describe the entities with physiological observations.
Sometimes you need semantics of the language to understand what is in the description
One of the prominent place that you can obtain medical knowledge is UMLS.
It provides human curated definitions of the entities using their physiological observations, this is ideal input to extract implicit entities.
Linguistic knowledge about the semantics of the terms can be obtained from WordNet
The definitions are text snippet and cannot be used as it is to understand implicit entity mentions.
Hence, we process them and create models that can be read by the machines.
The idea is that entity model will consist of essential characteristics described in the definitions and are machine readable
An entity can have multiple definitions, each definition uses different terminologies to explain the entity and this information is important since their manifestations can also use different terminologies.
So entity model benefit from this factor and use all definitions, it consists of entity indicators which are created using definitions. A collection of entity indicators is called entity model.
Here is an example for shortness of breath…
The r_i in the entity indicator is called representative power and it reflects the term’s ability to discriminately identify the mention of the entity
The entity representative term can be used to identify the candidate sentences with implicit entity mentions
The sentences with ERT but without the entity name are identified as candidate sentences and they are pruned to get essential portion of the sentence.
The entity linking step takes the pruned sentence and the entity model as inputs and compare them to see whether the sentence has mention of the entity of given model.
Basically pruned sentence is compared with each entity indicator and measure the similarity between them
Similarity is calculated by comparing the terms in entity indicator and candidate sentence
WordNet is used to understand the lexical semantics
The term similarity is weighted with its representative power and normalized
Total similarity value lies between -1 and +1
We evaluated this work by creating a dataset by re-annotating a SemEval dataset.
We selected 8 entities based on their frequency of appearance and selected 850 sentences with implicit mentions of these entities.
The table shows the distribution of the sentences for each entity in two classes.
In order to assess how our proposed algorithm works, we used one unsupervised and one supervised baseline.
As shown in first three rows our algorithm doing well in identifying the negated mentions of the entities and SVM does slightly better in identifying positive mentions.
So as the next step we added the similarity value we calculated as a feature to the supervised solution.
The supervised solution was able to leverage the semantics of the similarity value to enhance its results.
he was placed on mechanical ventilation shortly after presentation – mechanical ventilation is not in definitions, but experts annotate them positively – SVM captueres such bigrams in training
Although supervised algorithm does well with added features, it needs to be trained on each clinical entity.
This means it needs labeled data for each entity, we checked whether this can be compensated with our features
Then we studied our results with annotation confidence, each annotation has confidence marked from 1 to 5.
Low confidence reflects incomplete or ambiguous information
So we expect our results to increase as the annotation confidence increases.
The high confidence negations are mostly explicit, low confidence negations are confused (negation or absence)?
For example, annotators often argue whether ‘patient breathing at a rate of 15-20’ means the negation of entity ‘shortness of breath’ (because that is a normal breathing pattern) or just lacks a mention of the entity.
Before going into the solutions, let me define the focus of the dissertation
There are so many things can be expressed implicitly: ideas, opinions, and facts.
Twitter is very dynamic environment and it react to the real world event quickly
This is reflected in implicit mention of entities in tweets
Tweets use diverse characteristics of the entities to mention them in an implicit manner
The phrases they use are time-sensitive
Static manually curated definitions do not give sufficient information to decipher
Look at these examples:
Same entity can be referred using different characteristics – all these phrases refer to movie Gravity using different features of the movie
Knowledge should be upto date – same phrase can be used to refer different entities at different times
Same entity can referred with different phrases at different times
First example is similar to medical entity definitions where each entity had more than one way to express it, hence multiple definitions. But in this case the phrases listed here are not paraphrases, they are literally different ways to express the entity.
Twitter is very dynamic environment and it react to the real world event quickly
This is reflected in implicit mention of entities in tweets
Tweets use diverse characteristics of the entities to mention them in an implicit manner
The phrases they use are time-sensitive
Static manually curated definitions do not give sufficient information to decipher
Look at these examples:
Same entity can be referred using different characteristics – all these phrases refer to movie Gravity using different features of the movie
Knowledge should be upto date – same phrase can be used to refer different entities at different times
Same entity can referred with different phrases at different times
First example is similar to medical entity definitions where each entity had more than one way to express it, hence multiple definitions. But in this case the phrases listed here are not paraphrases, they are literally different ways to express the entity.
Knowledge acquisition should acquire very comprehensive knowledge – we use three different sources to acquire three different types of knowledge
We acquire factual knowledge from DBpedia,
contextual knowledge that reflects the temporally relevant topics to the entity are acquired from daily communications in Twitter
knowledge about the temporal relevance of the entity is obtained through number of hits to its Wikipedia page
The collected knowledge has unstructured text components specially tweets, we need to identify meaningful phrases from them
In order to do that we create a dictionary of meaningful phrases using Wikipedia page titles and anchor texts
Then these phrases are used to spot meaningful phrases in unstructured text – they are called semantic cues
Knowledge is modelled reflecting the association of the semantic cues towards the entities
Some semantic cues are shared by the entities
The modelled knowledge reflects the topical relationship between entities
The semantic cues are weighted based on their specificity towards entities – the cues that are common among multiple entities are weighted low
The entity model network provides necessary infrastructure to perform entity linking
Now we need to identify tweets with implicit entities
We do this by first filtering tweets from stream of tweets using keywords – then annotate them for explicit entity mentions, the tweets that are not annotated with entities are assumed to have implicit entity mentions
The information extraction phrase links the tweets with implicit entities with correct entity
It consists of two phrases
The first phrase find the candidates entities and second phrase rank candidate entities in a way that rank the correct entity to the top place
Lets walk through one example here
First it matches the phrases in the incoming tweet with the nodes in the entity model network
All entities that have matching node as neighbor will become candidates
The matching entities are scored based on the strength of the evidence – the strength captures how many semantic cues match and how specific those cues are for particular entity
Then we take top-k entities as candidates for ranking step
The ranking step considers the similarity between the tweet and the entity and temporal salience of the entity to rank candidates
Isn’t the candidate ranking enough? No, candidate ranking consider only matching phrases, how about the impact by the non matching phrases? If the non matching phrases are more specific to entity that entity should be ranked lower.
In order to evaluate this work we annotated tweets of two domains – movies and books
Here are the statistics of the annotation –
When you filter the tweets with keywords there are three types of tweets – tweets with explicit entities, tweets with implicit entities, and tweets with no entities
For example the evaluation data set 391 tweets with explicit movie entities, 207 tweets with implicit movie entities, and 117 with no movie mentions
Total tweets 1175
The evaluation dataset was collected in August 2014, we created entity model network for July 2014
We collected entities of interest by collecting 15,000 tweets and annotating them
So EMN consists of 617 movies and 102 books
We collected maximum 1000 tweets per each entity to extract contextual knowledge
Wikipedia statistics of July 2014 is used to estimate the temporal salience
We evaluated both candidate selection and disambiguation
Candidate selection achieves more than 90% recall and disambiguation achieved around 61% accuracy
Then we evaluated the contribution of the knowledge extracted from Twitter conversations – it is observed that there is a clear improvement in results in both steps when entity models are built with contextual knowledge
Candidate Selection Recall - proportion of the tweets that had correct entity within top k (k=25) candidates
Disambiguation accuracy – proportion of the tweets that had correct entity in the top position
Why disambiguation is hard: the number of candidates that you get is high and they are relevant, finding candidates for term Gravity vs movies with Sandra Bullock, second one has more candidates
Also, if it is a disambiguation of gravity, you have different types of entities, but in the second scenario, all of them are movies, so the disambiguation is harder among same type of entities vs entities of different types
Here are the four types of errors we found in entity linking step
Correct examples and show how EMN is activated with examples
Before going into the solutions, let me define the focus of the dissertation
There are so many things can be expressed implicitly: ideas, opinions, and facts.
Look at this example document snippet, it list down disorders, symptoms, medications, and procedures
But do not tell how they are related
A medical professional reading this document will create a graph like this after reading this document
These relationships exist between all types of entities in clinical documents.
If you have relevant domain knowledge about the relationships between these entities, it is possible to establish these relationships
Lets have a look at a scenario
The task boils down to find absent knowledge in the knowledge base in an efficient manner
Assume that you have this kind of knowledge which has the relationships between disorders and symptoms
Now given a document with entities shown on your right, you can establish the relationships between disorders and symptoms.
However, it seems knowledge base does not know why edema is present in the document.
This is frequently the case with clinical knowledge bases – they lack non-hierarchical relationships
Now you have the choice of asking from a domain expert for an explanation about the presence of edema in this document.
Instead what we did was, we develop an algorithm that leverages known things about edema in order to make the best guess among the three disorders here
Essentially, what it does is, it analyses the clinical narratives with available partial domain knowledge and identify the gaps in the knowledge base.
Then it suggest link that should be in knowledge base, these links ultimately help to establish implicit relationships between entities in clinical records
We start with collecting the knowledge about clinical entities in standard knowledge bases
This knowledge includes hierarchical and non-hierarchical relationships between the entities.
Hierarchical knowledge is obtained from UMLS and non-hierarchical knowledge is obtained from credible Web resources
Then we model the entities of interest using collected knowledge.
Entity models look like this.
In order to detect unexplained symptoms, we assume that all symptoms in clinical documents should be explained by the disorders present in same document.
Whenever we see symptoms present without explanation, we flag them.
The first step to do this is to identify the entities in clinical documents and establish known relationships
We used cTAKES to annotate clinical documents and then used acquired knowledge to establish relationships
All this is done using IntellegO ontology with perceptual reasoning
Now our task is to find the relationship that is not present in the knowledge base
A naïve method would say any of the co-occurring disorders can have relationship
But, can we use the knowledge that we have about the symptoms to make smart guesses?
Lets walk through a exercise that essentially does this
Assume S is the unexplained symptom and D1 to D5 are co-occurring disorders
We know that S is a symptom of D6 and D7
We know more about D6 and D7, lets use this knowledge
Now seems that there is a overlap between co-occurring disorders and disorders that are similar to D6 and D7
This says us that D2 and D4 are the best candidates to have relationship between S
So we pruned the search space and consult domain expert about the relationship between D2 and D4
Now if you are seeing this combination frequently, you have more confidence on your guesses. This is exactly what we did
We evaluated this work with 1500 electronic medical records
Our initial knowledge base had 86 disorders, 42 symptoms, and 255 relationships
In our evaluation dataset there were 29 distinct unexplained symptoms and the table on the right shows the most frequent disorders appear when edema was found unexplained
Now our algorithm ran in an iterative manner
Run the first round and find relationships add them to knowledge base and run same reasoning again on the corpus
The precision of the generated questions are around 73 in the first iteration and decreased as it goes on
Number of relationships –
1st iteration – 105 (105/142)
2nd iteration – 29 (20/29)
3rd iteration – 4 (4/9)
Adding more relationships improves the explainability of the documents
This is because we are identifying more implicit relationships in documents
Explainability of the corpus increased by 64% after 2nd iteration
In summary, we showed that we cannot afford to ignore implicit information in text
Knowledge about the world plays a significant role in extracting implicit information. I have demonstrated this with three applications.
Here is the summary we did in each component in each application we demonstrated in this talk
Unsupervised – no labelled data
Here is the explicitly list down contributions of the dissertation
I have helped few people in their ezDI internships