SlideShare ist ein Scribd-Unternehmen logo
1 von 62
1
Knowledge-driven Implicit Information
Extraction
Sujan Perera
Dissertation Committee : Drs. Amit P. Sheth (advisor), Krishnaprasad
Thirunarayan, Michael Raymer, Pablo N. Mendes (IBM Research)
Ph.D. Dissertation Defense
2
Information Extraction
• More than 70% of data in organizations exist in unstructured form1
• Extraction of structured information from unstructured data is a
fundamental task
“All home medications although his insulin
dose (nph 20 qPM) was halved (--> NPH
10 qPM) on the floor, and his sugars were
running in the 150s-250s range.”
Insulin
Cisapride
Diabetes Mellitus
Hyperglycemia
Proinsulin
Porcine Insulin Insulin Glulisine
is a is a
is a
1https://en.wikipedia.org/wiki/Unstructured_data
3
Information Extraction
• Almost exclusively focused on explicit information
“Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac
catheterization because of a positive exercise tolerance test. Recently, he started
to have left shoulder twinges and tingling in his hands. A stress test done on
2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to
fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed
accumulation of fluid in his extremities. He does not have any chest pain.”
4
Information Extraction
• Almost exclusively focused on explicit information
Named Entity Recognition Relationship Extraction
Entity Linking
“Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac
catheterization because of a positive exercise tolerance test. Recently, he started
to have left shoulder twinges and tingling in his hands. A stress test done on
2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to
fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed
accumulation of fluid in his extremities. He does not have any chest pain.”
Person Person C0018795
C0015672
C0008031
5
Information Extraction
• Misses the implicit information
“Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac
catheterization because of a positive exercise tolerance test. Recently, he started
to have left shoulder twinges and tingling in his hands. A stress test done on
2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to
fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed
accumulation of fluid in his extremities. He does not have any chest pain.”
Person Person C0018795
C0015672
C0008031
No shortness of breath
edema
Named Entity Recognition Relationship Extraction
Entity Linking Implicit information extraction
6
Thesis Statement
Implicit factual information in unstructured text can be efficiently
extracted by bridging syntactic and semantic gaps in natural language
usage and augmenting information extraction techniques with
relevant domain knowledge.
7
• Express sarcasm/sentiment
• “I'm striving to be positive in what I say on Twitter. So I'll refrain
from making a comment about the latest Michael Bay movie.”
• Provide descriptive information
• “small fluid adjacent to the gallbladder with gallstones which may
represent inflammation”
• Emphasize features of the entity
• “Mason Evans 12 year long shoot won big in golden globe”
• Communicate the common understanding
• “He is suffering from nausea and severe headaches. Dolasteron was
prescribed.”
• Stylistic Preferences
• “Democratic candidate Bernie Sanders … The Vermont senator …”
Credit:http://bit.ly/2b9Bnjk
8
Significance
• Volume
• 20% movie references and 40% book references in tweets
• 35% edema and 40% shortness of breath references in clinical
narratives
• Value
Explicit Information
Computer Assisted Coding
30-day Readmission
Prediction
Sentiment Analysis
Structured
Information
9
Significance
• Volume
• 20% movie references and 40% book references in tweets
• 35% edema and 40% shortness of breath references in clinical
narratives
• Value
Ignoring implicit information in text would adversely affect
downstream applications
Explicit Information
Implicit Information
Computer Assisted Coding
30-day Readmission
Prediction
Sentiment Analysis
Structured
Information
10
Role of Knowledge
New Sandra Bullock astronaut lost in space
movie looks absolutely terrifying
The patient showed accumulation of fluid in his
extremities, but respirations were unlabored and there
were no use of accessory muscles.
Edema Accumulation of an excessive amount
of watery fluid in cells or intercellular
tissues
Shortness
of breath
Labored or difficult breathing
associated with a variety of disorders
Sandra
Bullock
Gravity
Knowledge Bases
Image credits: http://bit.ly/2b5HPDQ and Icon made by Freepik from www.flaticon.com
Credit: http://bit.ly/2bi34FGCredit: http://bit.ly/1x3sack Credit: http://bit.ly/2b9CejW Credit: http://bit.ly/2aXM97v
11
Knowledge
Acquisition
Knowledge
Modeling
Detecting
Implicit
Information
Information
Extraction
Implicit Information Extraction
12
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narratives Tweets
Disorders Symptoms Movies Books
Clinical Narratives
Disorders and Symptoms
13
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narratives Tweets
Disorders Symptoms Movies Books
Clinical Narratives
Disorders and Symptoms
14
Sentence Entity
“small fluid adjacent to the gallbladder with gallstones which may represent
inflammation.”
Cholecystitis
“His tip of the appendix is inflamed.” Appendicitis
“The respirations were unlabored and there were no use of accessory muscles.” Shortness of breath (NEG)
Implicit Entities in Clinical Documents
• One should know the physiological observations that characterize
particular entity
• Negations are embedded in the phrases indicating entities
• “Patient denies shortness of breath”
• “The respirations were unlabored”
15
Knowledge Acquisition
• Unified Medical Language System – integrate many health and
biomedical vocabularies
• Linguistic Knowledge – WordNet
• Synonyms/antonyms
• Syntactic variations of the same term
CUI AUI STR
CUI TUI
CUI STR DEF SAB
Definitions for shortness of breath
A disorder characterized by an uncomfortable sensation of difficulty
breathing
Difficult or labored breathing
Labored or difficulty breathing associated with a variety of disorders,
indicating inadequate ventilation or low blood oxygen or a subjective
experience of breathing discomfort
16
Knowledge Modeling
• Each entity has multiple definitions
• Each definition is processed to create entity indicator
• Representative power of the term (r1) calculated with
measure inspired by TF-IDF
• A collection of entity indicators constitute entity
model
definition1
definition2
definition3
Entity
Indicator1
Entity
Indicator2
Entity
Indicator3
Entity Model
Definition Entity Indicator
A disorder characterized by an uncomfortable
sensation of difficulty breathing
(uncomfortable, r1), (sensation, r2),
(difficulty, r3), (breathing, r4)
Difficult or labored breathing (difficult, r5), (labored, r6), (breathing, r4)
17
Detecting Sentences with Implicit Entities
• The sentences with entity representative term but without the
entity name may have implicit mention of the entity.
“However, Mr. Smith is comfortably breathing in room air.”
Candidate sentence for shortness of breath
• The similarity between entity model and the pruned sentence is
measured to annotate them with positive or negative labels
• We developed a semantic similarity measure that takes care of the
synonyms and antonyms
18
Information Extraction – Entity Linking
Candidate Sentence
Indicator1
Indicator2
Indicator3
Entity Model
sim1
sim2
sim3
19
Information Extraction – Entity Linking
ct1
ct2
ct3
ct4
et5
et6
et7
Candidate Sentence Entity Indicator
WordNet
If antonym then -1
else max similarity
𝑠𝑖𝑚 ∗ 𝑟𝑝 𝑒𝑡
𝑟𝑝 𝑒𝑡
>t1
<t2
Positive Annotation
Negative Annotation
20
Evaluation
• Re-annotated the SemEval-
2014 task 7 dataset for implicit
entities
• Entities are selected
considering the frequency of
appearance and with expert
feedback
• 857 sentences selected for 8
entities
• Annotated by three domain
experts
• Annotation agreement 0.58
Entity Positive
Annotations
Negative
Annotations
Shortness of Breath 93 94
Edema 115 35
Syncope 96 92
Cholecystitis 78 36
Gastrointestinal Gas 18 14
Colitis 12 11
Cellulitis 8 2
Fasciitis 7 3
21
Algorithm Positive
Precision
Positive
Recall
Positive
F1
Negative
Precision
Negative
Recall
Negative
F1
Our 0.66 0.87 0.75 0.73 0.73 0.73
MCS 0.50 0.93 0.65 0.31 0.76 0.44
SVM 0.73 0.82 0.77 0.66 0.67 0.67
Adding similarity value as a feature for the supervised algorithm
SVM+MCS 0.73 0.82 0.77 0.66 0.66 0.66
SVM+Our 0.77 0.85 0.81 0.72 0.75 0.73
• Baselines
• MCS algorithm (Mihalcea 2006)
• SVM (trained on n-grams)
• Our algorithm outperforms selected baselines in negative category.
• SVM is able to leverage the supervision to beat our algorithm in
positive category.
Annotation Performance
22
Similarity as a Feature to Supervised Algorithm
• Added similarity value of unsupervised algorithms as a feature to the
SVM.
Positive Annotations Negative Annotations
23
Annotation Performance – A Study with the Confidence
• Each annotation has
confidence ranges from 1 to 5
• Low confidence reflects
incomplete or ambiguous
information
• Annotation performance
increases as the confidence
increases
• The negative class shows
significant increment
24
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narratives Tweets
Disorders Symptoms Movies Books
Clinical Narratives
Disorders and Symptoms
25
• Use diverse characteristics of the entity
– “New Sandra Bullock astronaut lost in space movie looks absolutely
terrifying”
– “ISRO sends probe to Mars for less money than it takes Hollywood to send a
woman to space.”
– “oh yeah there is that new space movie coming out that looks terrifying i am
going to go see it”
• Use time-sensitive phrases
Furious 7Gravity The Martian
Fall 2013 April 2014 Fall 2015
space movie
fastest movie to
earn $1 billion
Paul walkers’
last movie
Tweets with Implicit Entities
Credit: http://bit.ly/2bkePJ6
26
• Use diverse characteristics of the entity
– “… Richard Linklater movie …”
– “… Ellar Coltrane on his 12-year movie …”
– “… 12-year long movie shoot …”
– “… Mason Evan's childhood movie …”
• Use time-sensitive phrases
Furious 7Gravity The Martian
Fall 2013 April 2014 Fall 2015
space movie
fastest movie to
earn $1 billion
Paul walkers’
last movie
Tweets with Implicit Entities
Credit: http://bit.ly/2bk8xdp
27
Knowledge Acquisition
• Acquiring factual knowledge
• Source – DBpedia
• Not all factual knowledge is important – movie has ‘starring’ and
‘director’ as well as ‘billed‘ and ‘license’
• Rank the relationships based on joint probability with the entity type
• Values of top-k relationships and the value of rdfs:comment are obtained
• Acquiring contextual knowledge
• Source – contemporary tweets
• We collect 1000 tweets with explicit mentions of the entity
• Number of views for the entity’s Wikipedia page within last t days
28
Knowledge Acquisition
Wikipedia page titles
and anchor texts
Contemporary
tweets
Generate
semantic cues
Factual
knowledge
Clean tweets
Generate n-grams
• Need to extract meaningful phrases from acquired
knowledge
• Meaningful phrases = Wikipedia titles + anchor texts
• Matching n-grams are added to semantic cues
• Non-matching n-grams are added to semantic cues
after removing stop words
29
Knowledge Modeling – Entity Model Network
Sandra Bullock
Alfonso Curan
Mars orbiter mission
Woman in space
astronaut
• A property graph - reflecting the topical relationships between entities
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
|𝑁|
|𝑁𝑐 𝑗
|
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑝ℎ𝑟𝑎𝑠𝑒 𝑖𝑛 𝑡𝑤𝑒𝑒𝑡𝑠
𝑡𝑖𝑚𝑒 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 = number
of Wikipedia views
𝑁 − 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠, 𝑁𝑐 𝑗
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑑𝑗𝑎𝑐𝑒𝑛𝑡 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠
Factual Knowledge
Contextual Knowledge
Entity
Gravity
Christopher Nolan
Matt Damon
Interstellar
The Martian
30
Detecting Tweets with Implicit Entities
• Tweets are filtered with keywords – movie, film, book, novel
• Applied simple annotation technique – dictionary matching
• The tweets that are not annotated with entity of types we are
looking for are considered to have implicit entity mentions
Keywords
Entity
Dictionary
Annotating
Tweets
31
Information Extraction – Entity Linking
• Two Step Process
• Step 1: Candidate selection and filtering
• Objective - prune the search space to reduce number of entities to be
considered in disambiguation step from EMN
• Step 2: Disambiguation
• Objective - sort the selected candidate entities to place the implicitly
mentioned entity in top position
32
Entity Linking - Candidate selection and filtering
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c4
c6
c3
c2
c9
c7
“ISRO sends probe to Mars for less money
than it takes Hollywood movie to send a
woman to space”
m8
EntityFactual Knowledge Contextual Knowledge
33
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c6
c3
c2
c9
c7
“ISRO sends probe to Mars for less money
than it takes Hollywood movie to send a
woman to space” c5
c2 c7
c8
m8
Factual Knowledge Contextual Knowledge Entity
Entity Linking - Candidate selection and filtering
c4
34
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c6
c3
c2
c9
c7
c5
c2
m1
m2
m4
m5
m3
c7
c8
m6
m7
m8
Factual Knowledge Contextual Knowledge Entity
“ISRO sends probe to Mars for less money
than it takes Hollywood movie to send a
woman to space”
Entity Linking - Candidate selection and filtering
c4
35
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c6
c3
c2
c9
c7
c5
c2
m1
m2
m4
m5
m3
𝑠𝑐𝑜𝑟𝑒 𝑚 𝑖
=
𝑐 𝑗 𝜖 ℂ
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑓 𝑐𝑗 ∗ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑐𝑗, 𝑚𝑖)
c7
c8
m6
m7
m2
m4
m6
m7
m3
ℂ is the set of matching cues
m8
Factual Knowledge Contextual Knowledge Entity
“ISRO sends probe to Mars for less money
than it takes Hollywood movie to send a
woman to space”
Entity Linking - Candidate selection and filtering
c4
36
• Formulated as a ranking problem
• SVMrank to rank candidates
• Similarity between the candidate entity and the tweet
• Temporal salience of the candidate entity
x1 x2 x3 … xn
xj= 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑓 𝑐𝑗 ∗ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑐𝑗, 𝑒𝑖)
𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 𝑒𝑖
𝑒∈𝐸 𝑐
𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 𝑒
𝐸𝑐 is the selected candidate set
m2
m6
m4m3
m7
Entity Linking - Disambiguation
Evaluation Dataset
Entity Type Annotation Tweets Entity
Movie Explicit 391 107
Implicit 207 54
NULL 117 0
Book Explicit 200 24
Implicit 190 53
NULL 70 0
• Tweets are collected in August 2014 using keywords
• Manually annotated the tweets with DBpedia URL of entities
• The tweets annotated with NULL do not have either explicit or implicit
mention of an entity
37
Entity Model Network Creation
• 15,000 tweets for movies and books in July 2014
• 617 movies and 102 books
• Recent 1000 tweets per entity to build its contextual knowledge
• May 2014 version of DBpedia used to extract factual knowledge
• Temporal salience is obtained for July 2014
m1
m2 m4
m5
m3
m7
m6
c1
c5
c8
c4
c6
c3
c2
c9
c7
38Factual Knowledge Contextual Knowledge Entity
• How many tweets had correct entity within selected candidate set (top-25)?
• How many entities were correctly linked by our disambiguation approach?
• Importance of contextual knowledge
Evaluation - Implicit Entity Linking
Entity Type Candidate Selection Recall Disambiguation accuracy
Movie 90.33% 60.97%
Book 94.73% 61.05%
39
Step Entity Type Without Contextual
Knowledge
With Contextual
Knowledge
Candidate Selection
Recall
Movie 77.29% 90.33%
Book 76.84% 94.73%
Disambiguation
Accuracy
Movie 51.7% 60.97%
Book 50.0% 61.05%
Qualitative Error Analysis
Error Tweet Entity
Lack of contextual
knowledge
‘That Movie Where Shailene Woodley Has Her First Nude Scene?
The Trailer Is RIGHT HERE!: No one can say Shailene Woodley isn't
brave!’
White Bird in a Blizzard
Novel entities ‘”hey, what's wrawng widdis goose?" RT @TIME: Mark Wahlberg
could be starring in a movie about the BP oil spill
http://ti.me/1oZh55V'
Deepwater Horizon
Cold start of entities ‘Video: George R.R. Martin's Children's Book Gets Re-release
http://bit.ly/1qNNH5r’
The Ice Dragon
Multiple implicit entity
mentions
‘That moment when you realize that hazel grace and Augustus are
brother and sister in one movie and in love battling cancer in
another’
Divergent, The Fault in
Our Stars
40
41
Dissertation Focus
Implicit Information Extraction
Entities Relationships
Organized Text Unorganized Text
Clinical Narratives Tweets
Disorders Symptoms Movies Books
Clinical Narratives
Disorders and Symptoms
42
Implicit Relationships in Clinical Narratives
atrial fibrillation hypertension
diabetes
chest pain
weight gain
headache
lisinopril
warfarin
insulin
atenolol
medication
disease
symptom
is_treated_with
has_symptom
43
• Implicit relationships:
• Exist between symptoms, disorders, medications, and procedures
• Can be established by leveraging domain knowledge
• The existing knowledge bases fall short in eliciting relationships
• Data + Knowledge can help to elicit such implicit relationships
efficiently
Implicit Relationships in Clinical Narratives
44
A Scenario
Atrial fibrillation
Hypertension
Diabetes
Fatigue
Syncope
Weight loss
Chest pain
Discomfort in chest
Dizzy
Shortness of Breath
Nausea
Vomiting
Headache
Cough
Weight gain
45
A Scenario
Atrial fibrillation
Hypertension
Diabetes
Fatigue
Syncope
Weight loss
Chest pain
Discomfort in chest
Dizzy
Shortness of Breath
Nausea
Vomiting
Headache
Cough
Weight gain
Atrial fibrillation
Hypertension
Diabetes
Chest pain
Weight gain
Discomfort in chest
Cough
Headache
Edema
Shortness of Breath
Knowledge base does not know about edema. Now edema can
be a symptom of any disorder in the document.
Observed
Disorders
Observed
Symptoms
46
Knowledge Acquisition
• Hierarchical knowledge and non-hierarchical knowledge
Hierarchical Knowledge
Retrieved from UMLS
Non-hierarchical Knowledge
Extracted from Web Resources
+
Feedback from domain expert
www.nlm.nih.gov www.en.wikipedia.org
www.webmd.com www.mayoclinic.com
www.clevelandclinic.org ww.healthline.org
CUI AUI PAUI PTR
C0013404 A0052186 A0111363 A0434168.A2367943. …
C0013604 A0052723 A0135504 A0434168.A2367943
CUI AUI SAB STR
C0013404 A0052186 MSH Shortness of breath
C0013604 A0052723 MSH Edema
MRHIER
MRCONSO
47
Hypertension
Diastolic
Hypertension
Pulmonary
Hypertension
Renal
Hypertension
Episodic Pulmonary
Hypertension
Solitary Pulmonary
Hypertension
Breathing
Problems
Shortness of
Breath
Asthma
is_symptom_of
Instances of symptomsInstances of disorders
Shortness
of Breath
Hypertension
Classes of disorders Classes of symptoms
rdfs:subclassOf rdf:type
Knowledge Modeling
48
Detecting Unexplained Symptoms
• Clinical documents were semantically annotated for entities using
cTAKES
• Known relationships are populated
• Unexplained symptoms were detected Modeled
Knowledge
Credit:http://bit.ly/2aMWVAd
49
Information Extraction – Unknown Relationships
• Naïve method would assume relationship between unexplained
symptom and all disorders in clinical narrative
• Can we leverage the knowledge we have about symptom to find
most plausible disorders?
• Intuition: a symptom is most likely to be shared by similar disorders
1. All co-occurring
disorders are candidates
50
Information Extraction – Unknown Relationships
D1
S
D2
D3
D4
D5
2. Find known disorders
of the symptom
51
D1
S
D6
D7
D2
D3
D4
D5
Information Extraction – Unknown Relationships
3. Collect more
knowledge about
known relationships
52
D1
S
D6
D7
D2
D3
D4
D5
D7
D8 D2
D10 D11
D12
D4
D14
Information Extraction – Unknown Relationships
4. Compare co-
occurring disorders with
collected knowledge
53
D1
S
D6
D7
D2
D3
D4
D5
D7
D8 D2
D10 D11
D12
D4
D14
Information Extraction – Unknown Relationships
5. Eliminate non-
matching candidate
disorders
54
S
D2
D4
We left with most plausible disorders for unexplained symptom. If
this scenario occurs frequently, it increases the confidence on this
relationship.
Information Extraction – Unknown Relationships
55
Evaluation
• A corpus of 1,500 electronic medical records were used
• Annotated with cTAKES and selected the most frequent entities
were selected
• UMLS semantic types were used to categorize disorders and
symptoms
• Initial knowledge base - 86 disorders, 42 symptoms, 255 disorder-
symptom relationships
• There were 29 distinct unexplained symptoms
• Precision of the questions generated
• 1st iteration - 105 correct from 142 (73.94%)
• 2nd iteration - 20 correct from 29 (68.96%)
• 3rd iteration - 4 correct from 9 (44.44%)
56
Evaluation – Relationship Prediction
Symptom Number of unexplained
instances
Edema 910
Syncope 336
Systolic Murmur 168
Tachycardia 143
Angina 136
Disorder Number of co-
occurrences
Hypertension 647
Hyperlipidemia 641
Claudication 454
Coronary atherosclerosis 395
Coronary artery disease 242
Top 5 unexplained symptom Top 5 co-occurring disorders with edema
57
Evaluation – Increment in Explainability
Knowledge base Number of unexplained
relationships
Increment in explainability
Initial knowledge base 2251 0%
After 1st iteration 878 60.99%
After 2nd iteration 806 64.19%
58
Summary
• Implicit information is frequent occurrence in text and ignoring them
would adversely affect downstream applications.
• Linguistic and world Knowledge plays an important role in decoding
implicit information.
• This dissertation demonstrated characteristics of implicit information
and developed solution to capture factual implicit constructs.
Knowledge Acquisition Knowledge Modeling Detecting Implicit
Information
Information Extraction
59
Contributions
• Identify and demonstrate the value of implicit information.
• Study the characteristics of the implicit information manifestation.
• Demonstrate the value of knowledge in extracting factual implicit
information.
- Linguistic - Domain - Contextual
• Developed a framework for factual implicit information extraction.
• Demonstrated the usage of the framework to solve three implicit
information extraction problems.
60
Graduate Life@Kno.e.sis
Journal Publications:
• Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suhas
Nair, Semantics Driven Approach for Knowledge Acquisition from EMRs, IEEE Journal of
Biomedical and Health Informatics.
• Raminta Daniulaityte, Robert Carlson, Russel Falck, Delroy Cameron, Sujan Perera, Lu
Chen and Amit Sheth. I just wanted to tell you that loperamide WILL WORK': A Web-
Based Study of Extra-Medical Use of Loperamide.
Conference Publications:
• Sujan Perera, Pablo Mendes, Adarsh Alex, Amit Sheth, Krishnaprasad
Thirunarayan, Implicit Entity Linking in Tweets, ESWC 2016
• Sujan Perera, Pablo Mendes, Amit Sheth, Krishnaprasad Thirunarayan, Adarsh Alex,
Christopher Heid, Greg Mott, Implicit Entity Recognition in Clinical Documents, *SEM
2015
• Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suhas Nair, Data
Driven Knowledge Acquisition Method for Domain Knowledge Enrichment in the
Healthcare, BIBM 2012
• Menasha Thilakaratne, Ruvan Weerasinghe, Sujan Perera, Knowledge-driven Approach
to Predict Personality Traits by Leveraging Social Media Data, WI 2016
Workshop and Posters:
• Sujan Perera, Amit Sheth, Krishnaprasad Thirunarayan, Challenges in Understanding
Clinical Notes: Why NLP Engines Fall Short and Where Background Knowledge Can
Help, DARE 2013
• Raminta Daniulaityte, Robert Carlson, Russel Falck, Delroy Cameron, Sujan Perera, Lu
Chen, Amit Sheth. A Web-Based Study of Self-Treatment of Opioid Withdrawal
Symptoms with Loperamide, CPDD 2012
Internships:
• ezDI Summer 2012
• IBM Watson Summer 2014 and 2015
Awards and grants:
• George Thomas Graduate Fellowship
• NSF travel grants: BIBM and ICHI
PC Committee:
• DARE (2013), EKAW (2014, 2016), ISWC
2015, IJCAI 2016
External Reviewer:
• ISWC, ESWC, IJSWIS, IEEE Intelligent
Systems, Applied Ontology, ODBASE
Proposal Contributions:
• eDrugTrends (NIH R01)
• Healthcare Outcome Prediction (NSF-SCH)
Mentoring:
• Adarsh Alex (MSc)
• Menasha Tilakaratne (BSc)
61
Thank You
Mentors Collaborators
62
Coffee Mates and Colleagues
Thank You
Funding
• ezDI
• George Thomas Fellowship
• NSF: CNS 1513721 Context-
Aware Harassment Detection
on Social Media

Weitere ähnliche Inhalte

Was ist angesagt?

neuropredict: a proposal and a tool towards standardized and easy assessment ...
neuropredict: a proposal and a tool towards standardized and easy assessment ...neuropredict: a proposal and a tool towards standardized and easy assessment ...
neuropredict: a proposal and a tool towards standardized and easy assessment ...Pradeep Redddy Raamana
 
Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...Pradeep Redddy Raamana
 
CNNS Brochure
CNNS BrochureCNNS Brochure
CNNS BrochureCNNSUNT
 
Data Mining in Rediology reports
Data Mining in Rediology reportsData Mining in Rediology reports
Data Mining in Rediology reportsSaeed Mehrabi
 
Pharma data analytics
Pharma data analyticsPharma data analytics
Pharma data analyticsAxon Lawyers
 
Knowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text FinalKnowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text Finalkdjamies
 
Semantic Technology empowering Real World outcomes in Biomedical Research and...
Semantic Technology empowering Real World outcomes in Biomedical Research and...Semantic Technology empowering Real World outcomes in Biomedical Research and...
Semantic Technology empowering Real World outcomes in Biomedical Research and...Amit Sheth
 
Data science in health care
Data science in health careData science in health care
Data science in health careChetan Khanzode
 
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...geraintduck
 
'Stories that persuade with data' - talk at CENDI meeting January 9 2014
'Stories that persuade with data' - talk at CENDI meeting January 9 2014'Stories that persuade with data' - talk at CENDI meeting January 9 2014
'Stories that persuade with data' - talk at CENDI meeting January 9 2014Anita de Waard
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for HealthcareChandan Reddy
 

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...
Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...Giorgio Di Nunzio
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dcc.titus.brown
 
Local and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering ApproachLocal and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering ApproachIRJET Journal
 
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...Wei Zhong Toh
 
Cao report 2007-2012
Cao report 2007-2012Cao report 2007-2012
Cao report 2007-2012Elif Ceylan
 
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Ahmad C. Bukhari
 
Share and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelShare and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelKrzysztof Gorgolewski
 

Was ist angesagt? (20)

neuropredict: a proposal and a tool towards standardized and easy assessment ...
neuropredict: a proposal and a tool towards standardized and easy assessment ...neuropredict: a proposal and a tool towards standardized and easy assessment ...
neuropredict: a proposal and a tool towards standardized and easy assessment ...
 
Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...
 
Jane Howard
Jane HowardJane Howard
Jane Howard
 
CNNS Brochure
CNNS BrochureCNNS Brochure
CNNS Brochure
 
Data Mining in Rediology reports
Data Mining in Rediology reportsData Mining in Rediology reports
Data Mining in Rediology reports
 
Pharma data analytics
Pharma data analyticsPharma data analytics
Pharma data analytics
 
Reproducibility for IR evaluation
Reproducibility for IR evaluationReproducibility for IR evaluation
Reproducibility for IR evaluation
 
Knowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text FinalKnowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text Final
 
Semantic Technology empowering Real World outcomes in Biomedical Research and...
Semantic Technology empowering Real World outcomes in Biomedical Research and...Semantic Technology empowering Real World outcomes in Biomedical Research and...
Semantic Technology empowering Real World outcomes in Biomedical Research and...
 
Data science in health care
Data science in health careData science in health care
Data science in health care
 
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
 
'Stories that persuade with data' - talk at CENDI meeting January 9 2014
'Stories that persuade with data' - talk at CENDI meeting January 9 2014'Stories that persuade with data' - talk at CENDI meeting January 9 2014
'Stories that persuade with data' - talk at CENDI meeting January 9 2014
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for Healthcare
 

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...
Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...

Query Performance Prediction by Means of Intent-Aware Metrics in Systematic ...
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
Local and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering ApproachLocal and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering Approach
 
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
 
Cao report 2007-2012
Cao report 2007-2012Cao report 2007-2012
Cao report 2007-2012
 
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
 
Share and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelShare and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next level
 

Andere mochten auch

Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Artificial Intelligence Institute at UofSC
 
Personalized and Adaptive Semantic Information Filtering for Social Media - P...
Personalized and Adaptive Semantic Information Filtering for Social Media - P...Personalized and Adaptive Semantic Information Filtering for Social Media - P...
Personalized and Adaptive Semantic Information Filtering for Social Media - P...Artificial Intelligence Institute at UofSC
 
User-Generated Content on Social Media
User-Generated Content on Social MediaUser-Generated Content on Social Media
User-Generated Content on Social MediaMeena Nagarajan
 
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Artificial Intelligence Institute at UofSC
 
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...Artificial Intelligence Institute at UofSC
 
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Artificial Intelligence Institute at UofSC
 
Kno.e.sis Approach to Impactful Research & Training for Exceptional Careers
Kno.e.sis Approach to Impactful Research & Training for Exceptional CareersKno.e.sis Approach to Impactful Research & Training for Exceptional Careers
Kno.e.sis Approach to Impactful Research & Training for Exceptional CareersAmit Sheth
 
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...Artificial Intelligence Institute at UofSC
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
 

Andere mochten auch (20)

Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
 
Automatic Emotion Identification from Text
Automatic Emotion Identification from TextAutomatic Emotion Identification from Text
Automatic Emotion Identification from Text
 
Personalized and Adaptive Semantic Information Filtering for Social Media - P...
Personalized and Adaptive Semantic Information Filtering for Social Media - P...Personalized and Adaptive Semantic Information Filtering for Social Media - P...
Personalized and Adaptive Semantic Information Filtering for Social Media - P...
 
Mining and Analyzing Subjective Experiences in User-generated Content
Mining and Analyzing Subjective Experiences in User-generated ContentMining and Analyzing Subjective Experiences in User-generated Content
Mining and Analyzing Subjective Experiences in User-generated Content
 
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent MiningAshutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
Ashutosh Jadhav PhD Defense: Knowledge Driven Search Intent Mining
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
 
User-Generated Content on Social Media
User-Generated Content on Social MediaUser-Generated Content on Social Media
User-Generated Content on Social Media
 
PhD thesis defense of Christopher Thomas
PhD thesis defense of Christopher ThomasPhD thesis defense of Christopher Thomas
PhD thesis defense of Christopher Thomas
 
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
 
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
 
PhD thesis defense of Ajith Ranabahu
PhD thesis defense of Ajith RanabahuPhD thesis defense of Ajith Ranabahu
PhD thesis defense of Ajith Ranabahu
 
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
 
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and QueryingPrateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
 
Trust Management: A Tutorial
Trust Management: A TutorialTrust Management: A Tutorial
Trust Management: A Tutorial
 
Web and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sisWeb and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sis
 
2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review
 
Kno.e.sis Approach to Impactful Research & Training for Exceptional Careers
Kno.e.sis Approach to Impactful Research & Training for Exceptional CareersKno.e.sis Approach to Impactful Research & Training for Exceptional Careers
Kno.e.sis Approach to Impactful Research & Training for Exceptional Careers
 
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...
 
Kno.e.sis Review: late 2012 to mid 2013
Kno.e.sis Review: late 2012 to mid 2013Kno.e.sis Review: late 2012 to mid 2013
Kno.e.sis Review: late 2012 to mid 2013
 

Ähnlich wie Knowledge-driven Implicit Information Extraction

Implicit Entity Recognition in Clinical Documents
Implicit Entity Recognition in Clinical DocumentsImplicit Entity Recognition in Clinical Documents
Implicit Entity Recognition in Clinical DocumentsSujan Perera
 
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...Amit Sheth
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataYannick Pouliot
 
Using Data Analytics to Discover the 100 Trillion Bacteria Living Within Each...
Using Data Analytics to Discover the 100 Trillion Bacteria Living Within Each...Using Data Analytics to Discover the 100 Trillion Bacteria Living Within Each...
Using Data Analytics to Discover the 100 Trillion Bacteria Living Within Each...Larry Smarr
 
The Research Process
The Research ProcessThe Research Process
The Research ProcessK. Challinor
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesJosef Scheiber
 
Mental Workload Alerts - Reliable Brain Measurements of HCI using fNIRS - Uni...
Mental Workload Alerts - Reliable Brain Measurements of HCI using fNIRS - Uni...Mental Workload Alerts - Reliable Brain Measurements of HCI using fNIRS - Uni...
Mental Workload Alerts - Reliable Brain Measurements of HCI using fNIRS - Uni...Max L. Wilson
 
Clinician Decision Support By Clinicians, For Clinicians
Clinician Decision Support By Clinicians, For CliniciansClinician Decision Support By Clinicians, For Clinicians
Clinician Decision Support By Clinicians, For CliniciansDATA360US
 
Family Med Orientation July 2009
Family Med Orientation July 2009Family Med Orientation July 2009
Family Med Orientation July 2009Robin Featherstone
 
"Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ...
"Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ..."Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ...
"Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ...Hyper Wellbeing
 
GEMC- Advanced Cardiac Life Support- for Residents
GEMC- Advanced Cardiac Life Support- for ResidentsGEMC- Advanced Cardiac Life Support- for Residents
GEMC- Advanced Cardiac Life Support- for ResidentsOpen.Michigan
 
Lecture3-AI-Applications-In-Medicine-January23-2023.pptx
Lecture3-AI-Applications-In-Medicine-January23-2023.pptxLecture3-AI-Applications-In-Medicine-January23-2023.pptx
Lecture3-AI-Applications-In-Medicine-January23-2023.pptxUmayKulsoom2
 
Evidence Based Medicine
Evidence Based MedicineEvidence Based Medicine
Evidence Based MedicineCSN Vittal
 

Ähnlich wie Knowledge-driven Implicit Information Extraction (20)

Implicit Entity Recognition in Clinical Documents
Implicit Entity Recognition in Clinical DocumentsImplicit Entity Recognition in Clinical Documents
Implicit Entity Recognition in Clinical Documents
 
Implicit Entity Recognition in Clinical Documents
Implicit Entity Recognition in Clinical DocumentsImplicit Entity Recognition in Clinical Documents
Implicit Entity Recognition in Clinical Documents
 
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening Data
 
Patterns: the language of the subluxation
Patterns: the language of the subluxationPatterns: the language of the subluxation
Patterns: the language of the subluxation
 
Final_Presentation.pptx
Final_Presentation.pptxFinal_Presentation.pptx
Final_Presentation.pptx
 
Using Data Analytics to Discover the 100 Trillion Bacteria Living Within Each...
Using Data Analytics to Discover the 100 Trillion Bacteria Living Within Each...Using Data Analytics to Discover the 100 Trillion Bacteria Living Within Each...
Using Data Analytics to Discover the 100 Trillion Bacteria Living Within Each...
 
The Research Process
The Research ProcessThe Research Process
The Research Process
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use Cases
 
Mental Workload Alerts - Reliable Brain Measurements of HCI using fNIRS - Uni...
Mental Workload Alerts - Reliable Brain Measurements of HCI using fNIRS - Uni...Mental Workload Alerts - Reliable Brain Measurements of HCI using fNIRS - Uni...
Mental Workload Alerts - Reliable Brain Measurements of HCI using fNIRS - Uni...
 
Clinician Decision Support By Clinicians, For Clinicians
Clinician Decision Support By Clinicians, For CliniciansClinician Decision Support By Clinicians, For Clinicians
Clinician Decision Support By Clinicians, For Clinicians
 
Family Med Orientation July 2009
Family Med Orientation July 2009Family Med Orientation July 2009
Family Med Orientation July 2009
 
Gpt buchman
Gpt buchmanGpt buchman
Gpt buchman
 
Gpt buchman
Gpt buchmanGpt buchman
Gpt buchman
 
"Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ...
"Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ..."Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ...
"Hacking the Software for Life" - Brad Perkins (Chief Medical Officer, Human ...
 
GEMC- Advanced Cardiac Life Support- for Residents
GEMC- Advanced Cardiac Life Support- for ResidentsGEMC- Advanced Cardiac Life Support- for Residents
GEMC- Advanced Cardiac Life Support- for Residents
 
Lecture3-AI-Applications-In-Medicine-January23-2023.pptx
Lecture3-AI-Applications-In-Medicine-January23-2023.pptxLecture3-AI-Applications-In-Medicine-January23-2023.pptx
Lecture3-AI-Applications-In-Medicine-January23-2023.pptx
 
sleep-residents(1).ppt
sleep-residents(1).pptsleep-residents(1).ppt
sleep-residents(1).ppt
 
Evidence-Based Medicine
Evidence-Based Medicine Evidence-Based Medicine
Evidence-Based Medicine
 
Evidence Based Medicine
Evidence Based MedicineEvidence Based Medicine
Evidence Based Medicine
 

Kürzlich hochgeladen

convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 

Kürzlich hochgeladen (20)

convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 

Knowledge-driven Implicit Information Extraction

  • 1. 1 Knowledge-driven Implicit Information Extraction Sujan Perera Dissertation Committee : Drs. Amit P. Sheth (advisor), Krishnaprasad Thirunarayan, Michael Raymer, Pablo N. Mendes (IBM Research) Ph.D. Dissertation Defense
  • 2. 2 Information Extraction • More than 70% of data in organizations exist in unstructured form1 • Extraction of structured information from unstructured data is a fundamental task “All home medications although his insulin dose (nph 20 qPM) was halved (--> NPH 10 qPM) on the floor, and his sugars were running in the 150s-250s range.” Insulin Cisapride Diabetes Mellitus Hyperglycemia Proinsulin Porcine Insulin Insulin Glulisine is a is a is a 1https://en.wikipedia.org/wiki/Unstructured_data
  • 3. 3 Information Extraction • Almost exclusively focused on explicit information “Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac catheterization because of a positive exercise tolerance test. Recently, he started to have left shoulder twinges and tingling in his hands. A stress test done on 2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed accumulation of fluid in his extremities. He does not have any chest pain.”
  • 4. 4 Information Extraction • Almost exclusively focused on explicit information Named Entity Recognition Relationship Extraction Entity Linking “Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac catheterization because of a positive exercise tolerance test. Recently, he started to have left shoulder twinges and tingling in his hands. A stress test done on 2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed accumulation of fluid in his extremities. He does not have any chest pain.” Person Person C0018795 C0015672 C0008031
  • 5. 5 Information Extraction • Misses the implicit information “Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac catheterization because of a positive exercise tolerance test. Recently, he started to have left shoulder twinges and tingling in his hands. A stress test done on 2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed accumulation of fluid in his extremities. He does not have any chest pain.” Person Person C0018795 C0015672 C0008031 No shortness of breath edema Named Entity Recognition Relationship Extraction Entity Linking Implicit information extraction
  • 6. 6 Thesis Statement Implicit factual information in unstructured text can be efficiently extracted by bridging syntactic and semantic gaps in natural language usage and augmenting information extraction techniques with relevant domain knowledge.
  • 7. 7 • Express sarcasm/sentiment • “I'm striving to be positive in what I say on Twitter. So I'll refrain from making a comment about the latest Michael Bay movie.” • Provide descriptive information • “small fluid adjacent to the gallbladder with gallstones which may represent inflammation” • Emphasize features of the entity • “Mason Evans 12 year long shoot won big in golden globe” • Communicate the common understanding • “He is suffering from nausea and severe headaches. Dolasteron was prescribed.” • Stylistic Preferences • “Democratic candidate Bernie Sanders … The Vermont senator …” Credit:http://bit.ly/2b9Bnjk
  • 8. 8 Significance • Volume • 20% movie references and 40% book references in tweets • 35% edema and 40% shortness of breath references in clinical narratives • Value Explicit Information Computer Assisted Coding 30-day Readmission Prediction Sentiment Analysis Structured Information
  • 9. 9 Significance • Volume • 20% movie references and 40% book references in tweets • 35% edema and 40% shortness of breath references in clinical narratives • Value Ignoring implicit information in text would adversely affect downstream applications Explicit Information Implicit Information Computer Assisted Coding 30-day Readmission Prediction Sentiment Analysis Structured Information
  • 10. 10 Role of Knowledge New Sandra Bullock astronaut lost in space movie looks absolutely terrifying The patient showed accumulation of fluid in his extremities, but respirations were unlabored and there were no use of accessory muscles. Edema Accumulation of an excessive amount of watery fluid in cells or intercellular tissues Shortness of breath Labored or difficult breathing associated with a variety of disorders Sandra Bullock Gravity Knowledge Bases Image credits: http://bit.ly/2b5HPDQ and Icon made by Freepik from www.flaticon.com Credit: http://bit.ly/2bi34FGCredit: http://bit.ly/1x3sack Credit: http://bit.ly/2b9CejW Credit: http://bit.ly/2aXM97v
  • 12. 12 Dissertation Focus Implicit Information Extraction Entities Relationships Organized Text Unorganized Text Clinical Narratives Tweets Disorders Symptoms Movies Books Clinical Narratives Disorders and Symptoms
  • 13. 13 Dissertation Focus Implicit Information Extraction Entities Relationships Organized Text Unorganized Text Clinical Narratives Tweets Disorders Symptoms Movies Books Clinical Narratives Disorders and Symptoms
  • 14. 14 Sentence Entity “small fluid adjacent to the gallbladder with gallstones which may represent inflammation.” Cholecystitis “His tip of the appendix is inflamed.” Appendicitis “The respirations were unlabored and there were no use of accessory muscles.” Shortness of breath (NEG) Implicit Entities in Clinical Documents • One should know the physiological observations that characterize particular entity • Negations are embedded in the phrases indicating entities • “Patient denies shortness of breath” • “The respirations were unlabored”
  • 15. 15 Knowledge Acquisition • Unified Medical Language System – integrate many health and biomedical vocabularies • Linguistic Knowledge – WordNet • Synonyms/antonyms • Syntactic variations of the same term CUI AUI STR CUI TUI CUI STR DEF SAB Definitions for shortness of breath A disorder characterized by an uncomfortable sensation of difficulty breathing Difficult or labored breathing Labored or difficulty breathing associated with a variety of disorders, indicating inadequate ventilation or low blood oxygen or a subjective experience of breathing discomfort
  • 16. 16 Knowledge Modeling • Each entity has multiple definitions • Each definition is processed to create entity indicator • Representative power of the term (r1) calculated with measure inspired by TF-IDF • A collection of entity indicators constitute entity model definition1 definition2 definition3 Entity Indicator1 Entity Indicator2 Entity Indicator3 Entity Model Definition Entity Indicator A disorder characterized by an uncomfortable sensation of difficulty breathing (uncomfortable, r1), (sensation, r2), (difficulty, r3), (breathing, r4) Difficult or labored breathing (difficult, r5), (labored, r6), (breathing, r4)
  • 17. 17 Detecting Sentences with Implicit Entities • The sentences with entity representative term but without the entity name may have implicit mention of the entity. “However, Mr. Smith is comfortably breathing in room air.” Candidate sentence for shortness of breath
  • 18. • The similarity between entity model and the pruned sentence is measured to annotate them with positive or negative labels • We developed a semantic similarity measure that takes care of the synonyms and antonyms 18 Information Extraction – Entity Linking Candidate Sentence Indicator1 Indicator2 Indicator3 Entity Model sim1 sim2 sim3
  • 19. 19 Information Extraction – Entity Linking ct1 ct2 ct3 ct4 et5 et6 et7 Candidate Sentence Entity Indicator WordNet If antonym then -1 else max similarity 𝑠𝑖𝑚 ∗ 𝑟𝑝 𝑒𝑡 𝑟𝑝 𝑒𝑡 >t1 <t2 Positive Annotation Negative Annotation
  • 20. 20 Evaluation • Re-annotated the SemEval- 2014 task 7 dataset for implicit entities • Entities are selected considering the frequency of appearance and with expert feedback • 857 sentences selected for 8 entities • Annotated by three domain experts • Annotation agreement 0.58 Entity Positive Annotations Negative Annotations Shortness of Breath 93 94 Edema 115 35 Syncope 96 92 Cholecystitis 78 36 Gastrointestinal Gas 18 14 Colitis 12 11 Cellulitis 8 2 Fasciitis 7 3
  • 21. 21 Algorithm Positive Precision Positive Recall Positive F1 Negative Precision Negative Recall Negative F1 Our 0.66 0.87 0.75 0.73 0.73 0.73 MCS 0.50 0.93 0.65 0.31 0.76 0.44 SVM 0.73 0.82 0.77 0.66 0.67 0.67 Adding similarity value as a feature for the supervised algorithm SVM+MCS 0.73 0.82 0.77 0.66 0.66 0.66 SVM+Our 0.77 0.85 0.81 0.72 0.75 0.73 • Baselines • MCS algorithm (Mihalcea 2006) • SVM (trained on n-grams) • Our algorithm outperforms selected baselines in negative category. • SVM is able to leverage the supervision to beat our algorithm in positive category. Annotation Performance
  • 22. 22 Similarity as a Feature to Supervised Algorithm • Added similarity value of unsupervised algorithms as a feature to the SVM. Positive Annotations Negative Annotations
  • 23. 23 Annotation Performance – A Study with the Confidence • Each annotation has confidence ranges from 1 to 5 • Low confidence reflects incomplete or ambiguous information • Annotation performance increases as the confidence increases • The negative class shows significant increment
  • 24. 24 Dissertation Focus Implicit Information Extraction Entities Relationships Organized Text Unorganized Text Clinical Narratives Tweets Disorders Symptoms Movies Books Clinical Narratives Disorders and Symptoms
  • 25. 25 • Use diverse characteristics of the entity – “New Sandra Bullock astronaut lost in space movie looks absolutely terrifying” – “ISRO sends probe to Mars for less money than it takes Hollywood to send a woman to space.” – “oh yeah there is that new space movie coming out that looks terrifying i am going to go see it” • Use time-sensitive phrases Furious 7Gravity The Martian Fall 2013 April 2014 Fall 2015 space movie fastest movie to earn $1 billion Paul walkers’ last movie Tweets with Implicit Entities Credit: http://bit.ly/2bkePJ6
  • 26. 26 • Use diverse characteristics of the entity – “… Richard Linklater movie …” – “… Ellar Coltrane on his 12-year movie …” – “… 12-year long movie shoot …” – “… Mason Evan's childhood movie …” • Use time-sensitive phrases Furious 7Gravity The Martian Fall 2013 April 2014 Fall 2015 space movie fastest movie to earn $1 billion Paul walkers’ last movie Tweets with Implicit Entities Credit: http://bit.ly/2bk8xdp
  • 27. 27 Knowledge Acquisition • Acquiring factual knowledge • Source – DBpedia • Not all factual knowledge is important – movie has ‘starring’ and ‘director’ as well as ‘billed‘ and ‘license’ • Rank the relationships based on joint probability with the entity type • Values of top-k relationships and the value of rdfs:comment are obtained • Acquiring contextual knowledge • Source – contemporary tweets • We collect 1000 tweets with explicit mentions of the entity • Number of views for the entity’s Wikipedia page within last t days
  • 28. 28 Knowledge Acquisition Wikipedia page titles and anchor texts Contemporary tweets Generate semantic cues Factual knowledge Clean tweets Generate n-grams • Need to extract meaningful phrases from acquired knowledge • Meaningful phrases = Wikipedia titles + anchor texts • Matching n-grams are added to semantic cues • Non-matching n-grams are added to semantic cues after removing stop words
  • 29. 29 Knowledge Modeling – Entity Model Network Sandra Bullock Alfonso Curan Mars orbiter mission Woman in space astronaut • A property graph - reflecting the topical relationships between entities 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = |𝑁| |𝑁𝑐 𝑗 | 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑝ℎ𝑟𝑎𝑠𝑒 𝑖𝑛 𝑡𝑤𝑒𝑒𝑡𝑠 𝑡𝑖𝑚𝑒 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 = number of Wikipedia views 𝑁 − 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠, 𝑁𝑐 𝑗 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑑𝑗𝑎𝑐𝑒𝑛𝑡 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠 Factual Knowledge Contextual Knowledge Entity Gravity Christopher Nolan Matt Damon Interstellar The Martian
  • 30. 30 Detecting Tweets with Implicit Entities • Tweets are filtered with keywords – movie, film, book, novel • Applied simple annotation technique – dictionary matching • The tweets that are not annotated with entity of types we are looking for are considered to have implicit entity mentions Keywords Entity Dictionary Annotating Tweets
  • 31. 31 Information Extraction – Entity Linking • Two Step Process • Step 1: Candidate selection and filtering • Objective - prune the search space to reduce number of entities to be considered in disambiguation step from EMN • Step 2: Disambiguation • Objective - sort the selected candidate entities to place the implicitly mentioned entity in top position
  • 32. 32 Entity Linking - Candidate selection and filtering m1 m2 m4 m5 m3 m7 m6 c1 c5 c8 c4 c6 c3 c2 c9 c7 “ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space” m8 EntityFactual Knowledge Contextual Knowledge
  • 33. 33 m1 m2 m4 m5 m3 m7 m6 c1 c5 c8 c6 c3 c2 c9 c7 “ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space” c5 c2 c7 c8 m8 Factual Knowledge Contextual Knowledge Entity Entity Linking - Candidate selection and filtering c4
  • 34. 34 m1 m2 m4 m5 m3 m7 m6 c1 c5 c8 c6 c3 c2 c9 c7 c5 c2 m1 m2 m4 m5 m3 c7 c8 m6 m7 m8 Factual Knowledge Contextual Knowledge Entity “ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space” Entity Linking - Candidate selection and filtering c4
  • 35. 35 m1 m2 m4 m5 m3 m7 m6 c1 c5 c8 c6 c3 c2 c9 c7 c5 c2 m1 m2 m4 m5 m3 𝑠𝑐𝑜𝑟𝑒 𝑚 𝑖 = 𝑐 𝑗 𝜖 ℂ 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑓 𝑐𝑗 ∗ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑐𝑗, 𝑚𝑖) c7 c8 m6 m7 m2 m4 m6 m7 m3 ℂ is the set of matching cues m8 Factual Knowledge Contextual Knowledge Entity “ISRO sends probe to Mars for less money than it takes Hollywood movie to send a woman to space” Entity Linking - Candidate selection and filtering c4
  • 36. 36 • Formulated as a ranking problem • SVMrank to rank candidates • Similarity between the candidate entity and the tweet • Temporal salience of the candidate entity x1 x2 x3 … xn xj= 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑓 𝑐𝑗 ∗ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑐𝑗, 𝑒𝑖) 𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 𝑒𝑖 𝑒∈𝐸 𝑐 𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑎𝑙𝑖𝑒𝑛𝑐𝑒 𝑒 𝐸𝑐 is the selected candidate set m2 m6 m4m3 m7 Entity Linking - Disambiguation
  • 37. Evaluation Dataset Entity Type Annotation Tweets Entity Movie Explicit 391 107 Implicit 207 54 NULL 117 0 Book Explicit 200 24 Implicit 190 53 NULL 70 0 • Tweets are collected in August 2014 using keywords • Manually annotated the tweets with DBpedia URL of entities • The tweets annotated with NULL do not have either explicit or implicit mention of an entity 37
  • 38. Entity Model Network Creation • 15,000 tweets for movies and books in July 2014 • 617 movies and 102 books • Recent 1000 tweets per entity to build its contextual knowledge • May 2014 version of DBpedia used to extract factual knowledge • Temporal salience is obtained for July 2014 m1 m2 m4 m5 m3 m7 m6 c1 c5 c8 c4 c6 c3 c2 c9 c7 38Factual Knowledge Contextual Knowledge Entity
  • 39. • How many tweets had correct entity within selected candidate set (top-25)? • How many entities were correctly linked by our disambiguation approach? • Importance of contextual knowledge Evaluation - Implicit Entity Linking Entity Type Candidate Selection Recall Disambiguation accuracy Movie 90.33% 60.97% Book 94.73% 61.05% 39 Step Entity Type Without Contextual Knowledge With Contextual Knowledge Candidate Selection Recall Movie 77.29% 90.33% Book 76.84% 94.73% Disambiguation Accuracy Movie 51.7% 60.97% Book 50.0% 61.05%
  • 40. Qualitative Error Analysis Error Tweet Entity Lack of contextual knowledge ‘That Movie Where Shailene Woodley Has Her First Nude Scene? The Trailer Is RIGHT HERE!: No one can say Shailene Woodley isn't brave!’ White Bird in a Blizzard Novel entities ‘”hey, what's wrawng widdis goose?" RT @TIME: Mark Wahlberg could be starring in a movie about the BP oil spill http://ti.me/1oZh55V' Deepwater Horizon Cold start of entities ‘Video: George R.R. Martin's Children's Book Gets Re-release http://bit.ly/1qNNH5r’ The Ice Dragon Multiple implicit entity mentions ‘That moment when you realize that hazel grace and Augustus are brother and sister in one movie and in love battling cancer in another’ Divergent, The Fault in Our Stars 40
  • 41. 41 Dissertation Focus Implicit Information Extraction Entities Relationships Organized Text Unorganized Text Clinical Narratives Tweets Disorders Symptoms Movies Books Clinical Narratives Disorders and Symptoms
  • 42. 42 Implicit Relationships in Clinical Narratives atrial fibrillation hypertension diabetes chest pain weight gain headache lisinopril warfarin insulin atenolol medication disease symptom is_treated_with has_symptom
  • 43. 43 • Implicit relationships: • Exist between symptoms, disorders, medications, and procedures • Can be established by leveraging domain knowledge • The existing knowledge bases fall short in eliciting relationships • Data + Knowledge can help to elicit such implicit relationships efficiently Implicit Relationships in Clinical Narratives
  • 44. 44 A Scenario Atrial fibrillation Hypertension Diabetes Fatigue Syncope Weight loss Chest pain Discomfort in chest Dizzy Shortness of Breath Nausea Vomiting Headache Cough Weight gain
  • 45. 45 A Scenario Atrial fibrillation Hypertension Diabetes Fatigue Syncope Weight loss Chest pain Discomfort in chest Dizzy Shortness of Breath Nausea Vomiting Headache Cough Weight gain Atrial fibrillation Hypertension Diabetes Chest pain Weight gain Discomfort in chest Cough Headache Edema Shortness of Breath Knowledge base does not know about edema. Now edema can be a symptom of any disorder in the document. Observed Disorders Observed Symptoms
  • 46. 46 Knowledge Acquisition • Hierarchical knowledge and non-hierarchical knowledge Hierarchical Knowledge Retrieved from UMLS Non-hierarchical Knowledge Extracted from Web Resources + Feedback from domain expert www.nlm.nih.gov www.en.wikipedia.org www.webmd.com www.mayoclinic.com www.clevelandclinic.org ww.healthline.org CUI AUI PAUI PTR C0013404 A0052186 A0111363 A0434168.A2367943. … C0013604 A0052723 A0135504 A0434168.A2367943 CUI AUI SAB STR C0013404 A0052186 MSH Shortness of breath C0013604 A0052723 MSH Edema MRHIER MRCONSO
  • 47. 47 Hypertension Diastolic Hypertension Pulmonary Hypertension Renal Hypertension Episodic Pulmonary Hypertension Solitary Pulmonary Hypertension Breathing Problems Shortness of Breath Asthma is_symptom_of Instances of symptomsInstances of disorders Shortness of Breath Hypertension Classes of disorders Classes of symptoms rdfs:subclassOf rdf:type Knowledge Modeling
  • 48. 48 Detecting Unexplained Symptoms • Clinical documents were semantically annotated for entities using cTAKES • Known relationships are populated • Unexplained symptoms were detected Modeled Knowledge Credit:http://bit.ly/2aMWVAd
  • 49. 49 Information Extraction – Unknown Relationships • Naïve method would assume relationship between unexplained symptom and all disorders in clinical narrative • Can we leverage the knowledge we have about symptom to find most plausible disorders? • Intuition: a symptom is most likely to be shared by similar disorders
  • 50. 1. All co-occurring disorders are candidates 50 Information Extraction – Unknown Relationships D1 S D2 D3 D4 D5
  • 51. 2. Find known disorders of the symptom 51 D1 S D6 D7 D2 D3 D4 D5 Information Extraction – Unknown Relationships
  • 52. 3. Collect more knowledge about known relationships 52 D1 S D6 D7 D2 D3 D4 D5 D7 D8 D2 D10 D11 D12 D4 D14 Information Extraction – Unknown Relationships
  • 53. 4. Compare co- occurring disorders with collected knowledge 53 D1 S D6 D7 D2 D3 D4 D5 D7 D8 D2 D10 D11 D12 D4 D14 Information Extraction – Unknown Relationships
  • 54. 5. Eliminate non- matching candidate disorders 54 S D2 D4 We left with most plausible disorders for unexplained symptom. If this scenario occurs frequently, it increases the confidence on this relationship. Information Extraction – Unknown Relationships
  • 55. 55 Evaluation • A corpus of 1,500 electronic medical records were used • Annotated with cTAKES and selected the most frequent entities were selected • UMLS semantic types were used to categorize disorders and symptoms • Initial knowledge base - 86 disorders, 42 symptoms, 255 disorder- symptom relationships
  • 56. • There were 29 distinct unexplained symptoms • Precision of the questions generated • 1st iteration - 105 correct from 142 (73.94%) • 2nd iteration - 20 correct from 29 (68.96%) • 3rd iteration - 4 correct from 9 (44.44%) 56 Evaluation – Relationship Prediction Symptom Number of unexplained instances Edema 910 Syncope 336 Systolic Murmur 168 Tachycardia 143 Angina 136 Disorder Number of co- occurrences Hypertension 647 Hyperlipidemia 641 Claudication 454 Coronary atherosclerosis 395 Coronary artery disease 242 Top 5 unexplained symptom Top 5 co-occurring disorders with edema
  • 57. 57 Evaluation – Increment in Explainability Knowledge base Number of unexplained relationships Increment in explainability Initial knowledge base 2251 0% After 1st iteration 878 60.99% After 2nd iteration 806 64.19%
  • 58. 58 Summary • Implicit information is frequent occurrence in text and ignoring them would adversely affect downstream applications. • Linguistic and world Knowledge plays an important role in decoding implicit information. • This dissertation demonstrated characteristics of implicit information and developed solution to capture factual implicit constructs. Knowledge Acquisition Knowledge Modeling Detecting Implicit Information Information Extraction
  • 59. 59 Contributions • Identify and demonstrate the value of implicit information. • Study the characteristics of the implicit information manifestation. • Demonstrate the value of knowledge in extracting factual implicit information. - Linguistic - Domain - Contextual • Developed a framework for factual implicit information extraction. • Demonstrated the usage of the framework to solve three implicit information extraction problems.
  • 60. 60 Graduate Life@Kno.e.sis Journal Publications: • Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suhas Nair, Semantics Driven Approach for Knowledge Acquisition from EMRs, IEEE Journal of Biomedical and Health Informatics. • Raminta Daniulaityte, Robert Carlson, Russel Falck, Delroy Cameron, Sujan Perera, Lu Chen and Amit Sheth. I just wanted to tell you that loperamide WILL WORK': A Web- Based Study of Extra-Medical Use of Loperamide. Conference Publications: • Sujan Perera, Pablo Mendes, Adarsh Alex, Amit Sheth, Krishnaprasad Thirunarayan, Implicit Entity Linking in Tweets, ESWC 2016 • Sujan Perera, Pablo Mendes, Amit Sheth, Krishnaprasad Thirunarayan, Adarsh Alex, Christopher Heid, Greg Mott, Implicit Entity Recognition in Clinical Documents, *SEM 2015 • Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, Suhas Nair, Data Driven Knowledge Acquisition Method for Domain Knowledge Enrichment in the Healthcare, BIBM 2012 • Menasha Thilakaratne, Ruvan Weerasinghe, Sujan Perera, Knowledge-driven Approach to Predict Personality Traits by Leveraging Social Media Data, WI 2016 Workshop and Posters: • Sujan Perera, Amit Sheth, Krishnaprasad Thirunarayan, Challenges in Understanding Clinical Notes: Why NLP Engines Fall Short and Where Background Knowledge Can Help, DARE 2013 • Raminta Daniulaityte, Robert Carlson, Russel Falck, Delroy Cameron, Sujan Perera, Lu Chen, Amit Sheth. A Web-Based Study of Self-Treatment of Opioid Withdrawal Symptoms with Loperamide, CPDD 2012 Internships: • ezDI Summer 2012 • IBM Watson Summer 2014 and 2015 Awards and grants: • George Thomas Graduate Fellowship • NSF travel grants: BIBM and ICHI PC Committee: • DARE (2013), EKAW (2014, 2016), ISWC 2015, IJCAI 2016 External Reviewer: • ISWC, ESWC, IJSWIS, IEEE Intelligent Systems, Applied Ontology, ODBASE Proposal Contributions: • eDrugTrends (NIH R01) • Healthcare Outcome Prediction (NSF-SCH) Mentoring: • Adarsh Alex (MSc) • Menasha Tilakaratne (BSc)
  • 62. 62 Coffee Mates and Colleagues Thank You Funding • ezDI • George Thomas Fellowship • NSF: CNS 1513721 Context- Aware Harassment Detection on Social Media

Hinweis der Redaktion

  1. Hi Good morning everyone, thanks for attending my dissertation defense. My dissertation topic is “Knolwedge-driven Implicit Information Extraction” My dissertation committee is Dr. Sheth, my advisor, Dr. Prasad, Dr. Raymer, and joining us remotely Dr. Mendes from IBM research
  2. It is a well-known fact that more than 70% of data are unstructured The field of information extraction focuses on extracting structured information from unstructured data The information extraction can add more semantics to the text For example, when we know that the insulin here is the medication and not a regular English term we know more about the text We know it is probably treating diabetes and it should not be used with Cisapride
  3. We have exclusively focused on explicit information extraction Here is a text snippet extracted from a clinical narrative
  4. With the current information extraction techniques we can identify these entities and relationships, namely we can perform named entity recognition, entity linking, and relationship extraction on this text
  5. However, this snippet had these two phrases and they say that the patient does not have shortness of breath and he has edema. These were implicit and we don’t know how to identify these implicit entities in this text This is the problem I address in my dissertation work
  6. My thesis statement is …
  7. Implicit communication is not a random occurrence, people have genuine reasons to do so The first example here has negative sentiment towards Michael Bay movie and sarcastic in nature Sometimes the terms used to indicate entities does not provide complete and important information, the term cholecystitis does not provide enough information in second example, having better idea about the causes helps accurate treatments. However, in medical domain it is important to know these nuances We use them when we forget stuff and to emphasize features of the entity rather than the entity One more instance is when we know other party has required knowledge to interpret the statements
  8. Implicit information has enough volume to get notice and a value, at least on the domain we looked at. Our studies showed that some entities are mentioned implicitly between 20%-40% of the time In terms of value, extracting structured information is very important to applications like computer assisted coding an secondary data analysis like prediction tasks on medical data and tasks like sentiment analysis on tweets
  9. But all of them were functioning without the implicit information Sometimes, they were handicapped For example a sentiment analysis application would not know the target of the positive sentiment in the tweet “new Sandra Bullock space movie is terrific” Hence ignoring implicit entity mentions would adversely affect downstream applications
  10. Whenever we talk implicitly we assume a common understanding between us and the audience For example, the first scenario here, the speaker assumes that audience know about physiological observations that characterizes edema and shortness of breath In the second scenario, speaker assumes that audience know Sandra Bullock starred in Gravity These knowledge is available in different formats. Such resources contain linguistic knowledge, common sense knowledge, domain-specific and cross-domain world knowledge http://icons.mysitemyway.com/legacy-icon/115791-magic-marker-icon-people-things-people-audience/ Icon made by Freepik from www.flaticon.com
  11. In this dissertation I address the problem of implicit information extraction. Here are the four main components of the solution I developed. As demonstrated domain knowledge is an important component, hence the solution acquire relevant domain knowledge. The acquired knowledge does not always readable by machine algorithms Hence we model it in a way that can be consumed by algorithms. Then given a text it identifies whether it has implicit information of interest and develop techniques to extract such information. Within this talk I will demonstrate 3 applications of implicit information extraction and show how each of these high level components are realized to solve the problem at hand.
  12. Before going into the solutions, let me define the focus of the dissertation There are so many things can be expressed implicitly: ideas, opinions, and facts.
  13. Before going into the solutions, let me define the focus of the dissertation There are so many things can be expressed implicitly: ideas, opinions, and facts.
  14. What we observe is that when the entities are mentioned implicitly in clinical documents they describe the entities with physiological observations. Sometimes you need semantics of the language to understand what is in the description
  15. One of the prominent place that you can obtain medical knowledge is UMLS. It provides human curated definitions of the entities using their physiological observations, this is ideal input to extract implicit entities. Linguistic knowledge about the semantics of the terms can be obtained from WordNet
  16. The definitions are text snippet and cannot be used as it is to understand implicit entity mentions. Hence, we process them and create models that can be read by the machines. The idea is that entity model will consist of essential characteristics described in the definitions and are machine readable An entity can have multiple definitions, each definition uses different terminologies to explain the entity and this information is important since their manifestations can also use different terminologies. So entity model benefit from this factor and use all definitions, it consists of entity indicators which are created using definitions. A collection of entity indicators is called entity model. Here is an example for shortness of breath… The r_i in the entity indicator is called representative power and it reflects the term’s ability to discriminately identify the mention of the entity
  17. The entity representative term can be used to identify the candidate sentences with implicit entity mentions The sentences with ERT but without the entity name are identified as candidate sentences and they are pruned to get essential portion of the sentence.
  18. The entity linking step takes the pruned sentence and the entity model as inputs and compare them to see whether the sentence has mention of the entity of given model. Basically pruned sentence is compared with each entity indicator and measure the similarity between them
  19. Similarity is calculated by comparing the terms in entity indicator and candidate sentence WordNet is used to understand the lexical semantics The term similarity is weighted with its representative power and normalized Total similarity value lies between -1 and +1
  20. We evaluated this work by creating a dataset by re-annotating a SemEval dataset. We selected 8 entities based on their frequency of appearance and selected 850 sentences with implicit mentions of these entities. The table shows the distribution of the sentences for each entity in two classes.
  21. In order to assess how our proposed algorithm works, we used one unsupervised and one supervised baseline. As shown in first three rows our algorithm doing well in identifying the negated mentions of the entities and SVM does slightly better in identifying positive mentions. So as the next step we added the similarity value we calculated as a feature to the supervised solution. The supervised solution was able to leverage the semantics of the similarity value to enhance its results. he was placed on mechanical ventilation shortly after presentation – mechanical ventilation is not in definitions, but experts annotate them positively – SVM captueres such bigrams in training
  22. Although supervised algorithm does well with added features, it needs to be trained on each clinical entity. This means it needs labeled data for each entity, we checked whether this can be compensated with our features
  23. Then we studied our results with annotation confidence, each annotation has confidence marked from 1 to 5. Low confidence reflects incomplete or ambiguous information So we expect our results to increase as the annotation confidence increases. The high confidence negations are mostly explicit, low confidence negations are confused (negation or absence)? For example, annotators often argue whether ‘patient breathing at a rate of 15-20’ means the negation of entity ‘shortness of breath’ (because that is a normal breathing pattern) or just lacks a mention of the entity.
  24. Before going into the solutions, let me define the focus of the dissertation There are so many things can be expressed implicitly: ideas, opinions, and facts.
  25. Twitter is very dynamic environment and it react to the real world event quickly This is reflected in implicit mention of entities in tweets Tweets use diverse characteristics of the entities to mention them in an implicit manner The phrases they use are time-sensitive Static manually curated definitions do not give sufficient information to decipher Look at these examples: Same entity can be referred using different characteristics – all these phrases refer to movie Gravity using different features of the movie Knowledge should be upto date – same phrase can be used to refer different entities at different times Same entity can referred with different phrases at different times First example is similar to medical entity definitions where each entity had more than one way to express it, hence multiple definitions. But in this case the phrases listed here are not paraphrases, they are literally different ways to express the entity.
  26. Twitter is very dynamic environment and it react to the real world event quickly This is reflected in implicit mention of entities in tweets Tweets use diverse characteristics of the entities to mention them in an implicit manner The phrases they use are time-sensitive Static manually curated definitions do not give sufficient information to decipher Look at these examples: Same entity can be referred using different characteristics – all these phrases refer to movie Gravity using different features of the movie Knowledge should be upto date – same phrase can be used to refer different entities at different times Same entity can referred with different phrases at different times First example is similar to medical entity definitions where each entity had more than one way to express it, hence multiple definitions. But in this case the phrases listed here are not paraphrases, they are literally different ways to express the entity.
  27. Knowledge acquisition should acquire very comprehensive knowledge – we use three different sources to acquire three different types of knowledge We acquire factual knowledge from DBpedia, contextual knowledge that reflects the temporally relevant topics to the entity are acquired from daily communications in Twitter knowledge about the temporal relevance of the entity is obtained through number of hits to its Wikipedia page
  28. The collected knowledge has unstructured text components specially tweets, we need to identify meaningful phrases from them In order to do that we create a dictionary of meaningful phrases using Wikipedia page titles and anchor texts Then these phrases are used to spot meaningful phrases in unstructured text – they are called semantic cues
  29. Knowledge is modelled reflecting the association of the semantic cues towards the entities Some semantic cues are shared by the entities The modelled knowledge reflects the topical relationship between entities The semantic cues are weighted based on their specificity towards entities – the cues that are common among multiple entities are weighted low
  30. The entity model network provides necessary infrastructure to perform entity linking Now we need to identify tweets with implicit entities We do this by first filtering tweets from stream of tweets using keywords – then annotate them for explicit entity mentions, the tweets that are not annotated with entities are assumed to have implicit entity mentions
  31. The information extraction phrase links the tweets with implicit entities with correct entity It consists of two phrases The first phrase find the candidates entities and second phrase rank candidate entities in a way that rank the correct entity to the top place
  32. Lets walk through one example here
  33. First it matches the phrases in the incoming tweet with the nodes in the entity model network
  34. All entities that have matching node as neighbor will become candidates
  35. The matching entities are scored based on the strength of the evidence – the strength captures how many semantic cues match and how specific those cues are for particular entity Then we take top-k entities as candidates for ranking step
  36. The ranking step considers the similarity between the tweet and the entity and temporal salience of the entity to rank candidates Isn’t the candidate ranking enough? No, candidate ranking consider only matching phrases, how about the impact by the non matching phrases? If the non matching phrases are more specific to entity that entity should be ranked lower.
  37. In order to evaluate this work we annotated tweets of two domains – movies and books Here are the statistics of the annotation – When you filter the tweets with keywords there are three types of tweets – tweets with explicit entities, tweets with implicit entities, and tweets with no entities For example the evaluation data set 391 tweets with explicit movie entities, 207 tweets with implicit movie entities, and 117 with no movie mentions Total tweets 1175
  38. The evaluation dataset was collected in August 2014, we created entity model network for July 2014 We collected entities of interest by collecting 15,000 tweets and annotating them So EMN consists of 617 movies and 102 books We collected maximum 1000 tweets per each entity to extract contextual knowledge Wikipedia statistics of July 2014 is used to estimate the temporal salience
  39. We evaluated both candidate selection and disambiguation Candidate selection achieves more than 90% recall and disambiguation achieved around 61% accuracy Then we evaluated the contribution of the knowledge extracted from Twitter conversations – it is observed that there is a clear improvement in results in both steps when entity models are built with contextual knowledge Candidate Selection Recall - proportion of the tweets that had correct entity within top k (k=25) candidates Disambiguation accuracy – proportion of the tweets that had correct entity in the top position Why disambiguation is hard: the number of candidates that you get is high and they are relevant, finding candidates for term Gravity vs movies with Sandra Bullock, second one has more candidates Also, if it is a disambiguation of gravity, you have different types of entities, but in the second scenario, all of them are movies, so the disambiguation is harder among same type of entities vs entities of different types
  40. Here are the four types of errors we found in entity linking step Correct examples and show how EMN is activated with examples
  41. Before going into the solutions, let me define the focus of the dissertation There are so many things can be expressed implicitly: ideas, opinions, and facts.
  42. Look at this example document snippet, it list down disorders, symptoms, medications, and procedures But do not tell how they are related A medical professional reading this document will create a graph like this after reading this document
  43. These relationships exist between all types of entities in clinical documents. If you have relevant domain knowledge about the relationships between these entities, it is possible to establish these relationships Lets have a look at a scenario The task boils down to find absent knowledge in the knowledge base in an efficient manner
  44. Assume that you have this kind of knowledge which has the relationships between disorders and symptoms
  45. Now given a document with entities shown on your right, you can establish the relationships between disorders and symptoms. However, it seems knowledge base does not know why edema is present in the document. This is frequently the case with clinical knowledge bases – they lack non-hierarchical relationships Now you have the choice of asking from a domain expert for an explanation about the presence of edema in this document. Instead what we did was, we develop an algorithm that leverages known things about edema in order to make the best guess among the three disorders here Essentially, what it does is, it analyses the clinical narratives with available partial domain knowledge and identify the gaps in the knowledge base. Then it suggest link that should be in knowledge base, these links ultimately help to establish implicit relationships between entities in clinical records
  46. We start with collecting the knowledge about clinical entities in standard knowledge bases This knowledge includes hierarchical and non-hierarchical relationships between the entities. Hierarchical knowledge is obtained from UMLS and non-hierarchical knowledge is obtained from credible Web resources
  47. Then we model the entities of interest using collected knowledge. Entity models look like this.
  48. In order to detect unexplained symptoms, we assume that all symptoms in clinical documents should be explained by the disorders present in same document. Whenever we see symptoms present without explanation, we flag them. The first step to do this is to identify the entities in clinical documents and establish known relationships We used cTAKES to annotate clinical documents and then used acquired knowledge to establish relationships All this is done using IntellegO ontology with perceptual reasoning
  49. Now our task is to find the relationship that is not present in the knowledge base A naïve method would say any of the co-occurring disorders can have relationship But, can we use the knowledge that we have about the symptoms to make smart guesses? Lets walk through a exercise that essentially does this
  50. Assume S is the unexplained symptom and D1 to D5 are co-occurring disorders
  51. We know that S is a symptom of D6 and D7
  52. We know more about D6 and D7, lets use this knowledge
  53. Now seems that there is a overlap between co-occurring disorders and disorders that are similar to D6 and D7 This says us that D2 and D4 are the best candidates to have relationship between S
  54. So we pruned the search space and consult domain expert about the relationship between D2 and D4 Now if you are seeing this combination frequently, you have more confidence on your guesses. This is exactly what we did
  55. We evaluated this work with 1500 electronic medical records Our initial knowledge base had 86 disorders, 42 symptoms, and 255 relationships
  56. In our evaluation dataset there were 29 distinct unexplained symptoms and the table on the right shows the most frequent disorders appear when edema was found unexplained Now our algorithm ran in an iterative manner Run the first round and find relationships add them to knowledge base and run same reasoning again on the corpus The precision of the generated questions are around 73 in the first iteration and decreased as it goes on Number of relationships – 1st iteration – 105 (105/142) 2nd iteration – 29 (20/29) 3rd iteration – 4 (4/9)
  57. Adding more relationships improves the explainability of the documents This is because we are identifying more implicit relationships in documents Explainability of the corpus increased by 64% after 2nd iteration
  58. In summary, we showed that we cannot afford to ignore implicit information in text Knowledge about the world plays a significant role in extracting implicit information. I have demonstrated this with three applications. Here is the summary we did in each component in each application we demonstrated in this talk Unsupervised – no labelled data
  59. Here is the explicitly list down contributions of the dissertation
  60. I have helped few people in their ezDI internships
  61. Tonya and Jibril