SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Ohio Center of Excellence in Knowledge-Enabled Computing
Automatic Emotion Identification
from Text
Wenbo Wang
Kno.e.sis Center
Advisor:
Dr. Amit P. Sheth
Committee members:
Dr. Keke Chen
Kevin Haas
Dr. T.K. Prasad
Dr. Ramakanth Kavuluru
Ph.D. Dissertation Defense
Ohio Center of Excellence in Knowledge-Enabled Computing 2Sadness
Anger
Fear
Joy
Your emotions are the slaves to your thoughts,
and you are the slave to your emotions.
--Elizabeth Gilbert
Ohio Center of Excellence in Knowledge-Enabled Computing 3
S&P 500 dropped 1% …
Jon C. Ogg, credit
Stock Market
Ohio Center of Excellence in Knowledge-Enabled Computing 4
Employee Productivity
Credit, credit
Ohio Center of Excellence in Knowledge-Enabled Computing 5
Subjective Well-being
Credit, credit
Happiness IndexECG
Physical State Emotional State
Ohio Center of Excellence in Knowledge-Enabled Computing 6
Ohio Center of Excellence in Knowledge-Enabled Computing 7
Ohio Center of Excellence in Knowledge-Enabled Computing
Emotion Identification
• Emotion
– “a strong feeling (such as love, anger, joy, hate, or fear)” --
Merriam-Webster Online Dictionary
• Emotion Identification
– the task of automatically identifying and extracting the
emotions expressed in a given text.
• Examples
8
“I hate when my mom compares me to my friends” -> Anger
“When I see a cop, no matter where I am or what I’m doing,
I always feel like every law I’ve broken is stamped all over
my body” -> Fear
Ohio Center of Excellence in Knowledge-Enabled Computing
Proposed Questions
• How to glean people’s emotions from their texts using machine
learning techniques?
• How to create large self-labeled emotion data from social media?
• How to improve emotion identification in target domains (e.g.,
blog, diary) by leveraging large self-labeled emotion data from
social media?
9
Ohio Center of Excellence in Knowledge-Enabled Computing
1. EMOTION CLASSIFICATION
10
Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit P. Sheth. Discovering Fine-
grained Sentiment in Suicide Notes. Biomedical Informatics Insights, 2012
Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing
Twitter ‘Big Data’ for Automatic Emotion Identification. 2012 ASE International
Conference on Social Computing (SocialCom 2012)
Ohio Center of Excellence in Knowledge-Enabled Computing
Background - Classification
Credit: nltk 11
Ohio Center of Excellence in Knowledge-Enabled Computing
Dataset Description
• Suicide notes
– 15 fine-grained emotions
– Training: 4,633 sentences;
– Testing: 2086 sentences
• Twitter data
– 7 emotions
– Training: ~250 K tweets
– Testing: 250 K tweets
12
Ohio Center of Excellence in Knowledge-Enabled Computing
Suicide Notes Dataset
13
Sentence example:
“I loved you and was proud of
you.”
Unigrams: i, love, you, and,
be, proud, of, you, .
Bigrams: I love, love you, you
and, and be, be proud, proud
of, of you, you .
The combination of unigrams and
bigrams perform the best among n-gram
features.
Ohio Center of Excellence in Knowledge-Enabled Computing
Suicide Notes Dataset
14
Sentence example:
“I loved you and was proud of
you.”
LIWC Knowledge:
Posemo: 2 (love, proud)
Negemo: 0
Anger: 0
Sad: 0
Adding knowledge-based features
further increases the performance.
Ohio Center of Excellence in Knowledge-Enabled Computing
Suicide Notes Dataset
15
Sentence example:
“I loved you and was proud of
you .”
POS count:
Adjective: 1 (proud)
Noun: 0 ()
Pronoun: 3 (i, you)
…
Sentence tense:
Simple past tense: 2 (I loved,
was proud)
Adding sentence tenses and POS counts
further increases the performance
Ohio Center of Excellence in Knowledge-Enabled Computing
Twitter Dataset – Supervised Classifier
16
Applying only adjectives performs poorly because
emotions can be implicitly expressed in text.
Ohio Center of Excellence in Knowledge-Enabled Computing
Twitter Dataset – Supervised Classifier
17
The combination of unigrams and bigrams
perform the best among n-gram features.
Ohio Center of Excellence in Knowledge-Enabled Computing
Twitter Dataset – Supervised Classifier
18
Knowledge features and syntactic features
become less important on Twitter data.
Ohio Center of Excellence in Knowledge-Enabled Computing
Challenge: The Lack of Training Data
• Emotion annotation is typically time-consuming,
expensive and error-prone.
– multiple emotion categories
– subtle and ambiguous emotion expressions
– Human judgement of emotion tends to be subjective and
varied.
• Most of existing datasets are small, e.g.,
– Blog: 1,890 sentences (Aman and Szpakowicz 2008)
– Experience: 1,000 sentences (Neviarouskaya et. al. 2010)
– Diary: 700 sentences (Neviarouskaya et. al. 2011)
19
Ohio Center of Excellence in Knowledge-Enabled Computing
Why do We Need More Training Data? (I)
20
speech. The memory-based learner used only
the word before and word after as features.
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.1 1 10 100 1000
Millions of Words
TestAccuracy
Memory-Based
Winnow
Perceptron
Naïve Bayes
Figure 1. Learning Curves for Confusion Set
Disambiguation
We collected a 1-billion-word training
corpus from a variety of English texts, including
“We may want to reconsider the
trade-off between spending
time and money on algorithm
development versus spending it
on corpus development”
-- (Banko and Brill 2001)
From (Banko and Brill 2001)
Ohio Center of Excellence in Knowledge-Enabled Computing
Why do We Need More Training Data? (II)
• Emotions arise in various situations, which leads to very
diverse expressions conveying the emotions.
21
“I hate when my mom compares me to my friends”
“When I see a cop, no matter where I am or what
I’m doing, I always feel like every law I’ve broken is
stamped all over my body”
“I hate when I get the hiccups in class”
“Omg I finally fit into one pair of my jeans from last
year!!”
“A dog barked at me!”
Ohio Center of Excellence in Knowledge-Enabled Computing
The Use of Hashtags on Twitter
22
“I hate when my mom compares me to my friends
#annoying”
“When I see a cop, no matter where I am or what
I’m doing, I always feel like every law I’ve broken is
stamped all over my body #nervous”
“I hate when I get the hiccups in class
#embarrassing”
“Omg I finally fit into one pair of my jeans from last
year!! #excited”
“A dog barked at me! #scared #weak”
Ohio Center of Excellence in Knowledge-Enabled Computing
2. SELF-LABELED DATA
CREATION
23
Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing
Twitter ‘Big Data’ for Automatic Emotion Identification. 2012 ASE International
Conference on Social Computing (SocialCom 2012)
Ohio Center of Excellence in Knowledge-Enabled Computing
Emotion Hashtags
• From existing psychology literature (Shaver et. al.
1987), collected 7 sets of emotion words for 7 different
emotions – joy, sadness, anger, love, fear,
thankfulness, and surprise.
24
Emotion Hashtag Word Examples Number of Tweets
Joy excited, happy, elated, proud (36) 706,182
Sadness sorrow, unhappy, depressing, lonely (36) 616,471
Anger irritating, annoyed, frustrate, fury (23) 574,170
Love affection, lovin, loving, fondness (7) 301,759
Fear fear, panic, fright, worry, scare (22) 135,154
Thankfulness thankfulness, thankful (2) 131,340
Surprise surprised, astonished, unexpected (5) 23,906
Total 131 2,488,982
Ohio Center of Excellence in Knowledge-Enabled Computing
Removing Irrelevant Tweets
25
Hashtag count > 2
Emotion hashtag is not at the end
Word count < 5
Has URL or quotations
About 5 million tweets -> 2,488,982 tweets
Ohio Center of Excellence in Knowledge-Enabled Computing
Results with Increasing Training Data
0.4
0.45
0.5
0.55
0.6
0.65
1,000 10,000 248,898 497,796 746,694 995,592 1,244,490 1,493,388 1,742,286 1,991,184
accuracy
number of tweets in training data
LIBLINEAR
MNB
26
0.4341
0.5292 Logistic Regression (LR)
Training instance: 1K -> 2M
Percentage gain = 51.05%
0.6557
LR
0.6156
Ohio Center of Excellence in Knowledge-Enabled Computing
Results with Increasing Training Data
0.4
0.45
0.5
0.55
0.6
0.65
1,000 10,000 248,898 497,796 746,694 995,592 1,244,490 1,493,388 1,742,286 1,991,184
accuracy
number of tweets in training data
LIBLINEAR
MNB
27
0.4580
0.5426
Multinomial Naive Bayes (MNB)
Training instance: 1K -> 2M
Percentage gain = 38.65%
0.6350
LR
0.6113
Ohio Center of Excellence in Knowledge-Enabled Computing
For three popular emotions (76.2% of the tweets), the classifier
achieves F-measures of over 64%
Detailed Results
28
Ohio Center of Excellence in Knowledge-Enabled Computing
Detailed Results
29
For three less popular emotions (22.8% of the tweets), the
precisions are relatively higher compared with the recalls, and
the F-measures are over 43%.
Ohio Center of Excellence in Knowledge-Enabled Computing
What Have We Learned?
• We can automatically create training datasets for
emotion identification by leveraging emotion hashtags
on Twitter.
– A large amount of labeled data are collected with little effort
and cost
– Covers a variety of situations that elicit emotions
– Performance gain with increasing size of training data
• However, there is still a lack of labeled data in many
other domains/data sources.
30
Ohio Center of Excellence in Knowledge-Enabled Computing
New Challenge
31
Lots of labeled tweets
Far less labeled data in
many other domains
Can we use emotion-labeled tweets to help emotion
identification in other domains?
Ohio Center of Excellence in Knowledge-Enabled Computing
3. DOMAIN ADAPTATION FOR
EMOTION IDENTIFICATION
32
Wenbo Wang, Lu Chen, Keke Chen, Krishnaprasad Thirunarayan, Amit P. Sheth.
Domain Adaptation for Emotion Identification via Data Selection. Technical paper
(under review) 2015
Ohio Center of Excellence in Knowledge-Enabled Computing
Problem Definition
• Input
– Large amount of emotion-labeled tweets
– Small amount of labeled sentences from target
domains (e.g., blogs, fairy tales)
• Objective
– Select informative tweets and add them to target
domain training data, and train an adaptive classifier
for the target domain
33
Ohio Center of Excellence in Knowledge-Enabled Computing
The Bootstrapping Framework
34
Self-labeled tweets
Target domain labeled data
Credit1, credit2, credit3
• Train classifier c
• Apply c to tweets
Ohio Center of Excellence in Knowledge-Enabled Computing
The Bootstrapping Framework
35
Target domain labeled data
Credit1, credit2, credit3
Correctly
classified
Misclassified
• Train classifier c
• Apply c to tweets
• Identify informative tweets
from misclassified tweets
• Add them to target domain
training data
Why select from
misclassified tweets?
Ohio Center of Excellence in Knowledge-Enabled Computing
Informativeness Overview
36
Consistency Diversity Similarity
Ohio Center of Excellence in Knowledge-Enabled Computing
Consistency
• Fear: “Amazing night with my baby. Hope she liked our
anniversary present. Alil early but whatever. :) hopefully tmmrw
goes as planned.”
– Top supporting features for emotion fear
– Top supporting features for any emotion other than fear
– Use the margin to estimate consistency:
0.5094 – 0.5962 = -0.0868
37
Consistency measures how much is a tweet’s Label
consistent with its content.
Ohio Center of Excellence in Knowledge-Enabled Computing
Diversity
• Sadness: “Searching for vinyl proved to be quite disappointing”
– “disappoint” occurs 2 times
• Sadness: “I'm about to lose everything I've ever wanted, my
whole world, and it's all my fault..”
– “lose” occurs 15 times
38
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
term_freq
diversity
0.9048 (disappoint)
0.4724 (lose)
Exponential decay of its term
frequency in target domain
training data
Diversity encourages the selection of source instances containing
discriminative features that are infrequent or underrepresented in
the target domain.
Ohio Center of Excellence in Knowledge-Enabled Computing
Similarity Intuition
• Inspired by domain adaptation for machine translation
studies that select source instances similar to test
instances (Eck et al., 2004; Lu et al., 2007)
• Given a target test sentence
– Disgust: “im sick of look at a comput screen.”
• Retrieve most similar tweets
– Anger: “im sick and tire of look like a fool”
– Joy: “i have get usb fairi light around my comput screen .”
39
Content Similarity is not sufficient!
Ohio Center of Excellence in Knowledge-Enabled Computing
Similarity Overview
40
Content
similarity
Label
similarity Uncertainty
Ohio Center of Excellence in Knowledge-Enabled Computing
Content Similarity
• Upweight important words
– Source instance:
– Target test instance: inverse document frequency
(idf)
41
Ohio Center of Excellence in Knowledge-Enabled Computing
Label Similarity
• Target test sentence
• Disgust: “im sick of look at a comput screen.”
• Source tweet
• Anger: “im sick and tire of look like a fool”
• How likely will the test sentence express anger?
• Apply the same formula used for Consistency factor
• Top supporting features for emotion anger
• Top supporting features for any emotion other than anger
• Use the margin to estimate consistency: 0.5838 – 0.625 = -0.0412
42
Ohio Center of Excellence in Knowledge-Enabled Computing
Uncertainty
Sentence Label
Predicted
Label
Classifier
confidence
Uncertainty
the second day i go in and i
be so paranoid .
Fear Sadness 0.2352
we are total awesome! Joy Joy 0.8683
43
0.7648
0.1317
The more confident the classifier is, the more likely the prediction
is correct, the less focus we should give to this sentence.
Ohio Center of Excellence in Knowledge-Enabled Computing
Similarity Revisit
• Encourage the selection of source instances that share high
content and label similarities with target domain test instances
that classifier c is most uncertain about.
44
Content
similarity
Label
similarity Uncertainty
Ohio Center of Excellence in Knowledge-Enabled Computing
Informativeness Revisit
• A tweet is informative when
– 1) its label is consistent with its content
– AND 2) it contains a discriminative feature that is infrequent in
target training data
– AND 3) it is similar to an target domain test instance whose
label cannot be predicted by the classifier c with high
confidence.
45
Consistency Diversity Similarity
Our proposed approach: CDS
Ohio Center of Excellence in Knowledge-Enabled Computing
Baseline approaches
• Source Only (SO): train classifiers using only Twit
• Target Only (TO): train classifiers using only target domain
training data
• Feature Injection (FI): first train a source classifier using only
source data (Daume III, 2007)
• Feature Augmentation (FA) (Daume III, 2007)
– Source instances: X -> XX0 (common, source, target)
– Target instances: X -> XoX (common, source, target)
• Balance Weight (BW): assign larger weights for the target
instances so that the weight sum of target instances equals to
that of source instances (Jiang and Zhai, 2007)
46
Ohio Center of Excellence in Knowledge-Enabled Computing
Baseline approaches
• Source Only (SO): train classifiers using only Twit
• Target Only (TO): train classifiers using only target domain
training data
• Feature Injection (FI): first train a source classifier using only
source data (Daume III, 2007)
• Feature Augmentation (FA) (Daume III, 2007)
– Source instances: X -> XX0 (common, source, target)
– Target instances: X -> XoX (common, source, target)
• Balance Weight (BW): assign larger weights for the target
instances so that the weight sum of target instances equals to
that of source instances (Jiang and Zhai, 2007)
47
Ohio Center of Excellence in Knowledge-Enabled Computing
Experimental settings
• Features
– Experimented unigrams, bigrams, unigrams+bigrams
– Applied unigrams in the end
• Logistic regression
– Fast, support probability output (uncertainty)
• Five-fold cross validation
– Four folds: training; 1 fold; testing
• Add-0.5 smoothing
48
Ohio Center of Excellence in Knowledge-Enabled Computing
Results on four target datasets*
49
Percentage gain
8.01%
24.07%
36.53%
3.62%
16.45%
*: The numbers are different from those in the dissertation defense video, because I fixed a bug after that. Results
got slightly improved because of this.
Ohio Center of Excellence in Knowledge-Enabled Computing
Different Instance Selection Strategies
• CDS: select tweets from misclassified tweets
• CD: removed similarity factor from CDS
• CDS-ALL: select tweets from all source tweets
• CDS-CORR: select tweets from source tweets that can be
correctly classified by c
50
Ohio Center of Excellence in Knowledge-Enabled Computing
Comparing instance selection strategies
51
Among all the strategies, CDS
improves F1 in the fastest way.
Ohio Center of Excellence in Knowledge-Enabled Computing
Comparing instance selection strategies
52
CDS-ALL achieves a similar performance
as CDS does but takes more iterations,
because the input of CDS-ALL is a
superset of CDS.
Ohio Center of Excellence in Knowledge-Enabled Computing
Comparing instance selection strategies
53
CDS-CORR performs the worst because it
selects tweets from correctly classified tweets,
the knowledge of which might already exist in
target domains.
Ohio Center of Excellence in Knowledge-Enabled Computing
Summary
• People’s emotions can be gleaned from their texts using machine learning
techniques.
– The combination of n-grams (n=1,2), knowledge-based and syntactic features
achieves the best performance.
– Knowledge features and syntactic features become less important on large training
data.
• We can automatically create a large training dataset for emotion identification
by leveraging emotion hashtags on Twitter.
– A large amount of labeled data are collected with little effort and cost
– Covers a variety of situations that elicit emotions
– Performance gain with increasing size of training data
• This self-labeled emotion dataset can be used to improve emotion
identification in text from other domains/data sources.
– Domain adaptation via selecting tweets that are informative to the target domain
– It is superior to select source instances that cannot be correctly classified.
– Informativeness of a source instance is measured by three factors: consistency,
diversity and similarity.
54
Ohio Center of Excellence in Knowledge-Enabled Computing
Publications
• Wenbo Wang, Lei Duan, Anirudh Koul, Amit P. Sheth. YouRank: Let User Engagement Rank
Microblog Search Results. In the Eighth International AAAI Conference on Weblogs and Social
Media (ICWSM'14) 2014
• Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Cursing in English on Twitter.
In ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW'14)
2014
• Amit Sheth, Ashutosh Jadhav, Pavan Kapanipathi, Lu Chen, Hemant Purohit, Gary Alan Smith, and
Wenbo Wang. "Twitris: A system for collective social intelligence." In Encyclopedia of Social
Network Analysis and Mining, pp. 2240-2253. Springer New York, 2014.
• Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Users Equal in Predicting Elections? A Study of
User Groups in Predicting 2012 U.S. Republican Presidential Primaries. In Proceedings of the
Fourth International Conference on Social Informatics (SocInfo'12) 2012
• Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter ‘Big Data’
for Automatic Emotion Identification. 2012 ASE International Conference on Social Computing
(SocialCom 2012), 2012
• Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit P. Sheth. Extracting Diverse
Sentiment Expressions with Target-dependent Polarity from Twitter. In Proceedings of the 6th
International AAAI Conference on Weblogs and Social Media (ICWSM), 2012
55
Ohio Center of Excellence in Knowledge-Enabled Computing
Publications
• Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit P. Sheth. Discovering Fine-grained
Sentiment in Suicide Notes. Biomedical Informatics Insights, 2012
• Ramakanth Kavuluru, Christopher Thomas, Amit Sheth, Victor Chan, Wenbo Wang, Alan Smith, An
Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused
Bioscience Domains, IHI 2012 - 2nd ACM SIGHIT Intl Health Informatics Symposium, January 28-
30, 2012.
• Wenbo Wang, Christopher Thomas, Amit Sheth, Victor Chan. Pattern-Based Synonym and
Antonym Extraction. 48th ACM Southeast Conference, ACMSE2010, Oxford Mississippi, April 15-
17, 2010
• Christopher J. Thomas, Wenbo Wang, Pankaj Mehra, Delroy Cameron, Pablo N. Mendes, and Amit
P. Sheth.. What Goes Around Comes Around – Improving Linked Opend Data through On-Demand
Model Creation. In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, April
26-27th, 2010, Raleigh, NC: US.
• Ashutosh Jadhav, Wenbo Wang, Raghava Mutharaju, Pramod Anantharam, Vinh Nyugen, Amit P.
Sheth, Karthik Gomadam, Meenakshi Nagarajan, and Ajith Ranabahu, Twitris: Socially Influenced
Browsing, Semantic Web Challenge 2009, demo at 8th International Semantic Web Conference,
Oct. 25-29 2009, Washington, DC, USA
56
Ohio Center of Excellence in Knowledge-Enabled Computing
Patents & Proposal
• Wenbo Wang, Lei Duan. "Temporal User Engagement Features", U.S. Patent
No. 20,150,120,753. 30 Apr. 2015.
• Lu Chen, Wenbo Wang, Amit Sheth. "Topic-specific Sentiment Extraction", U.S.
Patent No. 20,140,358,523. 4 Dec. 2014.
• Context-Aware Harassment Detection on Social Media. NSF proposal
57
Ohio Center of Excellence in Knowledge-Enabled Computing
Special thanks to AFRL and NSF
58
Credit, credit
*Part of this material is based upon work supported by the National Science Foundation under Grant IIS-1111182 ``
SoCS: Collaborative Research: Social Media Enhanced Organizational Sensemaking in Emergency Response.''
Ohio Center of Excellence in Knowledge-Enabled Computing 59
Thank You! & Questions?

Weitere ähnliche Inhalte

Andere mochten auch

Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Artificial Intelligence Institute at UofSC
 
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan
 
User-Generated Content on Social Media
User-Generated Content on Social MediaUser-Generated Content on Social Media
User-Generated Content on Social MediaMeena Nagarajan
 
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Artificial Intelligence Institute at UofSC
 
Kno.e.sis Approach to Impactful Research & Training for Exceptional Careers
Kno.e.sis Approach to Impactful Research & Training for Exceptional CareersKno.e.sis Approach to Impactful Research & Training for Exceptional Careers
Kno.e.sis Approach to Impactful Research & Training for Exceptional CareersAmit Sheth
 
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...Artificial Intelligence Institute at UofSC
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
 
2017 sa tc_pi_meeting_-_poster final 2
2017 sa tc_pi_meeting_-_poster final 22017 sa tc_pi_meeting_-_poster final 2
2017 sa tc_pi_meeting_-_poster final 2Monireh Ebrahimi
 

Andere mochten auch (20)

Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
 
PhD thesis defense of Christopher Thomas
PhD thesis defense of Christopher ThomasPhD thesis defense of Christopher Thomas
PhD thesis defense of Christopher Thomas
 
Satya Sahoo Thesis Defense
Satya Sahoo Thesis DefenseSatya Sahoo Thesis Defense
Satya Sahoo Thesis Defense
 
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defense
 
User-Generated Content on Social Media
User-Generated Content on Social MediaUser-Generated Content on Social Media
User-Generated Content on Social Media
 
PhD thesis defense of Ajith Ranabahu
PhD thesis defense of Ajith RanabahuPhD thesis defense of Ajith Ranabahu
PhD thesis defense of Ajith Ranabahu
 
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
 
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and QueryingPrateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
 
Web and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sisWeb and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sis
 
Trust Management: A Tutorial
Trust Management: A TutorialTrust Management: A Tutorial
Trust Management: A Tutorial
 
2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review
 
Kno.e.sis Approach to Impactful Research & Training for Exceptional Careers
Kno.e.sis Approach to Impactful Research & Training for Exceptional CareersKno.e.sis Approach to Impactful Research & Training for Exceptional Careers
Kno.e.sis Approach to Impactful Research & Training for Exceptional Careers
 
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...
 
Kno.e.sis Review: late 2012 to mid 2013
Kno.e.sis Review: late 2012 to mid 2013Kno.e.sis Review: late 2012 to mid 2013
Kno.e.sis Review: late 2012 to mid 2013
 
Knoesis Student Achievement
Knoesis Student AchievementKnoesis Student Achievement
Knoesis Student Achievement
 
2017 sa tc_pi_meeting_-_poster final 2
2017 sa tc_pi_meeting_-_poster final 22017 sa tc_pi_meeting_-_poster final 2
2017 sa tc_pi_meeting_-_poster final 2
 
Context Aware Harassment Detection in Social Media [Overview]
Context Aware Harassment Detection in Social Media [Overview]Context Aware Harassment Detection in Social Media [Overview]
Context Aware Harassment Detection in Social Media [Overview]
 
Depression slides.pptx
Depression slides.pptxDepression slides.pptx
Depression slides.pptx
 
How secure are your systems
How secure are your systemsHow secure are your systems
How secure are your systems
 

Ähnlich wie Automatic Emotion Identification from Text

Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using mlPravin Katiyar
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Association for Computational Linguistics
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Association for Computational Linguistics
 
David papini escape emotional intelligence traps
David papini   escape emotional intelligence trapsDavid papini   escape emotional intelligence traps
David papini escape emotional intelligence trapsDavid Papini
 
Filippo Lanubile's talk @IASESE 2018
Filippo Lanubile's talk @IASESE 2018Filippo Lanubile's talk @IASESE 2018
Filippo Lanubile's talk @IASESE 2018Filippo Lanubile
 
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Eirini Ntoutsi
 
Leap from Doing Agile to Being Agile_AAC2019
Leap from Doing Agile to Being Agile_AAC2019Leap from Doing Agile to Being Agile_AAC2019
Leap from Doing Agile to Being Agile_AAC2019Agile Austria Conference
 
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...Edureka!
 
EI/EQ_Why and How
EI/EQ_Why and HowEI/EQ_Why and How
EI/EQ_Why and HowSeda Maurer
 
University Of Texas At Dallas Application Essay
University Of Texas At Dallas Application EssayUniversity Of Texas At Dallas Application Essay
University Of Texas At Dallas Application EssayBrittany Koch
 
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
Slides chase 2019  connected health conference - thursday 26 september 2019 -...Slides chase 2019  connected health conference - thursday 26 september 2019 -...
Slides chase 2019 connected health conference - thursday 26 september 2019 -...Amélie Gyrard
 
Go Reboot Yourself: Get a Grip on Your Tech
Go Reboot Yourself: Get a Grip on Your TechGo Reboot Yourself: Get a Grip on Your Tech
Go Reboot Yourself: Get a Grip on Your TechAliza Sherman
 
Emotionalinteligence
EmotionalinteligenceEmotionalinteligence
Emotionalinteligencemahimashukla
 

Ähnlich wie Automatic Emotion Identification from Text (20)

Media 330057 smxx
Media 330057 smxxMedia 330057 smxx
Media 330057 smxx
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using ml
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
David papini escape emotional intelligence traps
David papini   escape emotional intelligence trapsDavid papini   escape emotional intelligence traps
David papini escape emotional intelligence traps
 
Filippo Lanubile's talk @IASESE 2018
Filippo Lanubile's talk @IASESE 2018Filippo Lanubile's talk @IASESE 2018
Filippo Lanubile's talk @IASESE 2018
 
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
 
Leap from Doing Agile to Being Agile_AAC2019
Leap from Doing Agile to Being Agile_AAC2019Leap from Doing Agile to Being Agile_AAC2019
Leap from Doing Agile to Being Agile_AAC2019
 
Emotions of Facebook Data
Emotions of Facebook DataEmotions of Facebook Data
Emotions of Facebook Data
 
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
 
EI/EQ_Why and How
EI/EQ_Why and HowEI/EQ_Why and How
EI/EQ_Why and How
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Emotion Detection
Emotion DetectionEmotion Detection
Emotion Detection
 
University Of Texas At Dallas Application Essay
University Of Texas At Dallas Application EssayUniversity Of Texas At Dallas Application Essay
University Of Texas At Dallas Application Essay
 
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
Slides chase 2019  connected health conference - thursday 26 september 2019 -...Slides chase 2019  connected health conference - thursday 26 september 2019 -...
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
 
Sentiment Analysis.pptx
Sentiment Analysis.pptxSentiment Analysis.pptx
Sentiment Analysis.pptx
 
Go Reboot Yourself: Get a Grip on Your Tech
Go Reboot Yourself: Get a Grip on Your TechGo Reboot Yourself: Get a Grip on Your Tech
Go Reboot Yourself: Get a Grip on Your Tech
 
Emotional Inteligence
Emotional InteligenceEmotional Inteligence
Emotional Inteligence
 
Emotionalinteligence
EmotionalinteligenceEmotionalinteligence
Emotionalinteligence
 
SAMA 2016 for LinkedIn
SAMA 2016 for LinkedInSAMA 2016 for LinkedIn
SAMA 2016 for LinkedIn
 

Kürzlich hochgeladen

GST rules for Assessment, inspection, seizue, arrest.pptx.pdf
GST rules for Assessment, inspection, seizue, arrest.pptx.pdfGST rules for Assessment, inspection, seizue, arrest.pptx.pdf
GST rules for Assessment, inspection, seizue, arrest.pptx.pdfGauravSharma214068
 
Sustaining Freshness- Tacking Food Waste and Plastic Packaging in the Supply ...
Sustaining Freshness- Tacking Food Waste and Plastic Packaging in the Supply ...Sustaining Freshness- Tacking Food Waste and Plastic Packaging in the Supply ...
Sustaining Freshness- Tacking Food Waste and Plastic Packaging in the Supply ...Jasper Colin
 
Supermarket Floral Ad Roundup- Week 15 2024.pdf
Supermarket Floral Ad Roundup- Week 15 2024.pdfSupermarket Floral Ad Roundup- Week 15 2024.pdf
Supermarket Floral Ad Roundup- Week 15 2024.pdfKarliNelson4
 
Clothing Display Racks Catlogue - Fidvi Steel
Clothing Display Racks Catlogue - Fidvi SteelClothing Display Racks Catlogue - Fidvi Steel
Clothing Display Racks Catlogue - Fidvi SteelFidvi Ferro
 
Top 5 Tech Trends Revolutionizing Retail in 2024
Top 5 Tech Trends Revolutionizing Retail in 2024Top 5 Tech Trends Revolutionizing Retail in 2024
Top 5 Tech Trends Revolutionizing Retail in 2024WWG
 
Calvin Klein Proposal/Analysis- Brand Overview
Calvin Klein Proposal/Analysis- Brand OverviewCalvin Klein Proposal/Analysis- Brand Overview
Calvin Klein Proposal/Analysis- Brand OverviewNathifa DeAndrade
 

Kürzlich hochgeladen (6)

GST rules for Assessment, inspection, seizue, arrest.pptx.pdf
GST rules for Assessment, inspection, seizue, arrest.pptx.pdfGST rules for Assessment, inspection, seizue, arrest.pptx.pdf
GST rules for Assessment, inspection, seizue, arrest.pptx.pdf
 
Sustaining Freshness- Tacking Food Waste and Plastic Packaging in the Supply ...
Sustaining Freshness- Tacking Food Waste and Plastic Packaging in the Supply ...Sustaining Freshness- Tacking Food Waste and Plastic Packaging in the Supply ...
Sustaining Freshness- Tacking Food Waste and Plastic Packaging in the Supply ...
 
Supermarket Floral Ad Roundup- Week 15 2024.pdf
Supermarket Floral Ad Roundup- Week 15 2024.pdfSupermarket Floral Ad Roundup- Week 15 2024.pdf
Supermarket Floral Ad Roundup- Week 15 2024.pdf
 
Clothing Display Racks Catlogue - Fidvi Steel
Clothing Display Racks Catlogue - Fidvi SteelClothing Display Racks Catlogue - Fidvi Steel
Clothing Display Racks Catlogue - Fidvi Steel
 
Top 5 Tech Trends Revolutionizing Retail in 2024
Top 5 Tech Trends Revolutionizing Retail in 2024Top 5 Tech Trends Revolutionizing Retail in 2024
Top 5 Tech Trends Revolutionizing Retail in 2024
 
Calvin Klein Proposal/Analysis- Brand Overview
Calvin Klein Proposal/Analysis- Brand OverviewCalvin Klein Proposal/Analysis- Brand Overview
Calvin Klein Proposal/Analysis- Brand Overview
 

Automatic Emotion Identification from Text

  • 1. Ohio Center of Excellence in Knowledge-Enabled Computing Automatic Emotion Identification from Text Wenbo Wang Kno.e.sis Center Advisor: Dr. Amit P. Sheth Committee members: Dr. Keke Chen Kevin Haas Dr. T.K. Prasad Dr. Ramakanth Kavuluru Ph.D. Dissertation Defense
  • 2. Ohio Center of Excellence in Knowledge-Enabled Computing 2Sadness Anger Fear Joy Your emotions are the slaves to your thoughts, and you are the slave to your emotions. --Elizabeth Gilbert
  • 3. Ohio Center of Excellence in Knowledge-Enabled Computing 3 S&P 500 dropped 1% … Jon C. Ogg, credit Stock Market
  • 4. Ohio Center of Excellence in Knowledge-Enabled Computing 4 Employee Productivity Credit, credit
  • 5. Ohio Center of Excellence in Knowledge-Enabled Computing 5 Subjective Well-being Credit, credit Happiness IndexECG Physical State Emotional State
  • 6. Ohio Center of Excellence in Knowledge-Enabled Computing 6
  • 7. Ohio Center of Excellence in Knowledge-Enabled Computing 7
  • 8. Ohio Center of Excellence in Knowledge-Enabled Computing Emotion Identification • Emotion – “a strong feeling (such as love, anger, joy, hate, or fear)” -- Merriam-Webster Online Dictionary • Emotion Identification – the task of automatically identifying and extracting the emotions expressed in a given text. • Examples 8 “I hate when my mom compares me to my friends” -> Anger “When I see a cop, no matter where I am or what I’m doing, I always feel like every law I’ve broken is stamped all over my body” -> Fear
  • 9. Ohio Center of Excellence in Knowledge-Enabled Computing Proposed Questions • How to glean people’s emotions from their texts using machine learning techniques? • How to create large self-labeled emotion data from social media? • How to improve emotion identification in target domains (e.g., blog, diary) by leveraging large self-labeled emotion data from social media? 9
  • 10. Ohio Center of Excellence in Knowledge-Enabled Computing 1. EMOTION CLASSIFICATION 10 Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit P. Sheth. Discovering Fine- grained Sentiment in Suicide Notes. Biomedical Informatics Insights, 2012 Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter ‘Big Data’ for Automatic Emotion Identification. 2012 ASE International Conference on Social Computing (SocialCom 2012)
  • 11. Ohio Center of Excellence in Knowledge-Enabled Computing Background - Classification Credit: nltk 11
  • 12. Ohio Center of Excellence in Knowledge-Enabled Computing Dataset Description • Suicide notes – 15 fine-grained emotions – Training: 4,633 sentences; – Testing: 2086 sentences • Twitter data – 7 emotions – Training: ~250 K tweets – Testing: 250 K tweets 12
  • 13. Ohio Center of Excellence in Knowledge-Enabled Computing Suicide Notes Dataset 13 Sentence example: “I loved you and was proud of you.” Unigrams: i, love, you, and, be, proud, of, you, . Bigrams: I love, love you, you and, and be, be proud, proud of, of you, you . The combination of unigrams and bigrams perform the best among n-gram features.
  • 14. Ohio Center of Excellence in Knowledge-Enabled Computing Suicide Notes Dataset 14 Sentence example: “I loved you and was proud of you.” LIWC Knowledge: Posemo: 2 (love, proud) Negemo: 0 Anger: 0 Sad: 0 Adding knowledge-based features further increases the performance.
  • 15. Ohio Center of Excellence in Knowledge-Enabled Computing Suicide Notes Dataset 15 Sentence example: “I loved you and was proud of you .” POS count: Adjective: 1 (proud) Noun: 0 () Pronoun: 3 (i, you) … Sentence tense: Simple past tense: 2 (I loved, was proud) Adding sentence tenses and POS counts further increases the performance
  • 16. Ohio Center of Excellence in Knowledge-Enabled Computing Twitter Dataset – Supervised Classifier 16 Applying only adjectives performs poorly because emotions can be implicitly expressed in text.
  • 17. Ohio Center of Excellence in Knowledge-Enabled Computing Twitter Dataset – Supervised Classifier 17 The combination of unigrams and bigrams perform the best among n-gram features.
  • 18. Ohio Center of Excellence in Knowledge-Enabled Computing Twitter Dataset – Supervised Classifier 18 Knowledge features and syntactic features become less important on Twitter data.
  • 19. Ohio Center of Excellence in Knowledge-Enabled Computing Challenge: The Lack of Training Data • Emotion annotation is typically time-consuming, expensive and error-prone. – multiple emotion categories – subtle and ambiguous emotion expressions – Human judgement of emotion tends to be subjective and varied. • Most of existing datasets are small, e.g., – Blog: 1,890 sentences (Aman and Szpakowicz 2008) – Experience: 1,000 sentences (Neviarouskaya et. al. 2010) – Diary: 700 sentences (Neviarouskaya et. al. 2011) 19
  • 20. Ohio Center of Excellence in Knowledge-Enabled Computing Why do We Need More Training Data? (I) 20 speech. The memory-based learner used only the word before and word after as features. 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.1 1 10 100 1000 Millions of Words TestAccuracy Memory-Based Winnow Perceptron Naïve Bayes Figure 1. Learning Curves for Confusion Set Disambiguation We collected a 1-billion-word training corpus from a variety of English texts, including “We may want to reconsider the trade-off between spending time and money on algorithm development versus spending it on corpus development” -- (Banko and Brill 2001) From (Banko and Brill 2001)
  • 21. Ohio Center of Excellence in Knowledge-Enabled Computing Why do We Need More Training Data? (II) • Emotions arise in various situations, which leads to very diverse expressions conveying the emotions. 21 “I hate when my mom compares me to my friends” “When I see a cop, no matter where I am or what I’m doing, I always feel like every law I’ve broken is stamped all over my body” “I hate when I get the hiccups in class” “Omg I finally fit into one pair of my jeans from last year!!” “A dog barked at me!”
  • 22. Ohio Center of Excellence in Knowledge-Enabled Computing The Use of Hashtags on Twitter 22 “I hate when my mom compares me to my friends #annoying” “When I see a cop, no matter where I am or what I’m doing, I always feel like every law I’ve broken is stamped all over my body #nervous” “I hate when I get the hiccups in class #embarrassing” “Omg I finally fit into one pair of my jeans from last year!! #excited” “A dog barked at me! #scared #weak”
  • 23. Ohio Center of Excellence in Knowledge-Enabled Computing 2. SELF-LABELED DATA CREATION 23 Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter ‘Big Data’ for Automatic Emotion Identification. 2012 ASE International Conference on Social Computing (SocialCom 2012)
  • 24. Ohio Center of Excellence in Knowledge-Enabled Computing Emotion Hashtags • From existing psychology literature (Shaver et. al. 1987), collected 7 sets of emotion words for 7 different emotions – joy, sadness, anger, love, fear, thankfulness, and surprise. 24 Emotion Hashtag Word Examples Number of Tweets Joy excited, happy, elated, proud (36) 706,182 Sadness sorrow, unhappy, depressing, lonely (36) 616,471 Anger irritating, annoyed, frustrate, fury (23) 574,170 Love affection, lovin, loving, fondness (7) 301,759 Fear fear, panic, fright, worry, scare (22) 135,154 Thankfulness thankfulness, thankful (2) 131,340 Surprise surprised, astonished, unexpected (5) 23,906 Total 131 2,488,982
  • 25. Ohio Center of Excellence in Knowledge-Enabled Computing Removing Irrelevant Tweets 25 Hashtag count > 2 Emotion hashtag is not at the end Word count < 5 Has URL or quotations About 5 million tweets -> 2,488,982 tweets
  • 26. Ohio Center of Excellence in Knowledge-Enabled Computing Results with Increasing Training Data 0.4 0.45 0.5 0.55 0.6 0.65 1,000 10,000 248,898 497,796 746,694 995,592 1,244,490 1,493,388 1,742,286 1,991,184 accuracy number of tweets in training data LIBLINEAR MNB 26 0.4341 0.5292 Logistic Regression (LR) Training instance: 1K -> 2M Percentage gain = 51.05% 0.6557 LR 0.6156
  • 27. Ohio Center of Excellence in Knowledge-Enabled Computing Results with Increasing Training Data 0.4 0.45 0.5 0.55 0.6 0.65 1,000 10,000 248,898 497,796 746,694 995,592 1,244,490 1,493,388 1,742,286 1,991,184 accuracy number of tweets in training data LIBLINEAR MNB 27 0.4580 0.5426 Multinomial Naive Bayes (MNB) Training instance: 1K -> 2M Percentage gain = 38.65% 0.6350 LR 0.6113
  • 28. Ohio Center of Excellence in Knowledge-Enabled Computing For three popular emotions (76.2% of the tweets), the classifier achieves F-measures of over 64% Detailed Results 28
  • 29. Ohio Center of Excellence in Knowledge-Enabled Computing Detailed Results 29 For three less popular emotions (22.8% of the tweets), the precisions are relatively higher compared with the recalls, and the F-measures are over 43%.
  • 30. Ohio Center of Excellence in Knowledge-Enabled Computing What Have We Learned? • We can automatically create training datasets for emotion identification by leveraging emotion hashtags on Twitter. – A large amount of labeled data are collected with little effort and cost – Covers a variety of situations that elicit emotions – Performance gain with increasing size of training data • However, there is still a lack of labeled data in many other domains/data sources. 30
  • 31. Ohio Center of Excellence in Knowledge-Enabled Computing New Challenge 31 Lots of labeled tweets Far less labeled data in many other domains Can we use emotion-labeled tweets to help emotion identification in other domains?
  • 32. Ohio Center of Excellence in Knowledge-Enabled Computing 3. DOMAIN ADAPTATION FOR EMOTION IDENTIFICATION 32 Wenbo Wang, Lu Chen, Keke Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Domain Adaptation for Emotion Identification via Data Selection. Technical paper (under review) 2015
  • 33. Ohio Center of Excellence in Knowledge-Enabled Computing Problem Definition • Input – Large amount of emotion-labeled tweets – Small amount of labeled sentences from target domains (e.g., blogs, fairy tales) • Objective – Select informative tweets and add them to target domain training data, and train an adaptive classifier for the target domain 33
  • 34. Ohio Center of Excellence in Knowledge-Enabled Computing The Bootstrapping Framework 34 Self-labeled tweets Target domain labeled data Credit1, credit2, credit3 • Train classifier c • Apply c to tweets
  • 35. Ohio Center of Excellence in Knowledge-Enabled Computing The Bootstrapping Framework 35 Target domain labeled data Credit1, credit2, credit3 Correctly classified Misclassified • Train classifier c • Apply c to tweets • Identify informative tweets from misclassified tweets • Add them to target domain training data Why select from misclassified tweets?
  • 36. Ohio Center of Excellence in Knowledge-Enabled Computing Informativeness Overview 36 Consistency Diversity Similarity
  • 37. Ohio Center of Excellence in Knowledge-Enabled Computing Consistency • Fear: “Amazing night with my baby. Hope she liked our anniversary present. Alil early but whatever. :) hopefully tmmrw goes as planned.” – Top supporting features for emotion fear – Top supporting features for any emotion other than fear – Use the margin to estimate consistency: 0.5094 – 0.5962 = -0.0868 37 Consistency measures how much is a tweet’s Label consistent with its content.
  • 38. Ohio Center of Excellence in Knowledge-Enabled Computing Diversity • Sadness: “Searching for vinyl proved to be quite disappointing” – “disappoint” occurs 2 times • Sadness: “I'm about to lose everything I've ever wanted, my whole world, and it's all my fault..” – “lose” occurs 15 times 38 0.00 0.25 0.50 0.75 1.00 0 25 50 75 100 term_freq diversity 0.9048 (disappoint) 0.4724 (lose) Exponential decay of its term frequency in target domain training data Diversity encourages the selection of source instances containing discriminative features that are infrequent or underrepresented in the target domain.
  • 39. Ohio Center of Excellence in Knowledge-Enabled Computing Similarity Intuition • Inspired by domain adaptation for machine translation studies that select source instances similar to test instances (Eck et al., 2004; Lu et al., 2007) • Given a target test sentence – Disgust: “im sick of look at a comput screen.” • Retrieve most similar tweets – Anger: “im sick and tire of look like a fool” – Joy: “i have get usb fairi light around my comput screen .” 39 Content Similarity is not sufficient!
  • 40. Ohio Center of Excellence in Knowledge-Enabled Computing Similarity Overview 40 Content similarity Label similarity Uncertainty
  • 41. Ohio Center of Excellence in Knowledge-Enabled Computing Content Similarity • Upweight important words – Source instance: – Target test instance: inverse document frequency (idf) 41
  • 42. Ohio Center of Excellence in Knowledge-Enabled Computing Label Similarity • Target test sentence • Disgust: “im sick of look at a comput screen.” • Source tweet • Anger: “im sick and tire of look like a fool” • How likely will the test sentence express anger? • Apply the same formula used for Consistency factor • Top supporting features for emotion anger • Top supporting features for any emotion other than anger • Use the margin to estimate consistency: 0.5838 – 0.625 = -0.0412 42
  • 43. Ohio Center of Excellence in Knowledge-Enabled Computing Uncertainty Sentence Label Predicted Label Classifier confidence Uncertainty the second day i go in and i be so paranoid . Fear Sadness 0.2352 we are total awesome! Joy Joy 0.8683 43 0.7648 0.1317 The more confident the classifier is, the more likely the prediction is correct, the less focus we should give to this sentence.
  • 44. Ohio Center of Excellence in Knowledge-Enabled Computing Similarity Revisit • Encourage the selection of source instances that share high content and label similarities with target domain test instances that classifier c is most uncertain about. 44 Content similarity Label similarity Uncertainty
  • 45. Ohio Center of Excellence in Knowledge-Enabled Computing Informativeness Revisit • A tweet is informative when – 1) its label is consistent with its content – AND 2) it contains a discriminative feature that is infrequent in target training data – AND 3) it is similar to an target domain test instance whose label cannot be predicted by the classifier c with high confidence. 45 Consistency Diversity Similarity Our proposed approach: CDS
  • 46. Ohio Center of Excellence in Knowledge-Enabled Computing Baseline approaches • Source Only (SO): train classifiers using only Twit • Target Only (TO): train classifiers using only target domain training data • Feature Injection (FI): first train a source classifier using only source data (Daume III, 2007) • Feature Augmentation (FA) (Daume III, 2007) – Source instances: X -> XX0 (common, source, target) – Target instances: X -> XoX (common, source, target) • Balance Weight (BW): assign larger weights for the target instances so that the weight sum of target instances equals to that of source instances (Jiang and Zhai, 2007) 46
  • 47. Ohio Center of Excellence in Knowledge-Enabled Computing Baseline approaches • Source Only (SO): train classifiers using only Twit • Target Only (TO): train classifiers using only target domain training data • Feature Injection (FI): first train a source classifier using only source data (Daume III, 2007) • Feature Augmentation (FA) (Daume III, 2007) – Source instances: X -> XX0 (common, source, target) – Target instances: X -> XoX (common, source, target) • Balance Weight (BW): assign larger weights for the target instances so that the weight sum of target instances equals to that of source instances (Jiang and Zhai, 2007) 47
  • 48. Ohio Center of Excellence in Knowledge-Enabled Computing Experimental settings • Features – Experimented unigrams, bigrams, unigrams+bigrams – Applied unigrams in the end • Logistic regression – Fast, support probability output (uncertainty) • Five-fold cross validation – Four folds: training; 1 fold; testing • Add-0.5 smoothing 48
  • 49. Ohio Center of Excellence in Knowledge-Enabled Computing Results on four target datasets* 49 Percentage gain 8.01% 24.07% 36.53% 3.62% 16.45% *: The numbers are different from those in the dissertation defense video, because I fixed a bug after that. Results got slightly improved because of this.
  • 50. Ohio Center of Excellence in Knowledge-Enabled Computing Different Instance Selection Strategies • CDS: select tweets from misclassified tweets • CD: removed similarity factor from CDS • CDS-ALL: select tweets from all source tweets • CDS-CORR: select tweets from source tweets that can be correctly classified by c 50
  • 51. Ohio Center of Excellence in Knowledge-Enabled Computing Comparing instance selection strategies 51 Among all the strategies, CDS improves F1 in the fastest way.
  • 52. Ohio Center of Excellence in Knowledge-Enabled Computing Comparing instance selection strategies 52 CDS-ALL achieves a similar performance as CDS does but takes more iterations, because the input of CDS-ALL is a superset of CDS.
  • 53. Ohio Center of Excellence in Knowledge-Enabled Computing Comparing instance selection strategies 53 CDS-CORR performs the worst because it selects tweets from correctly classified tweets, the knowledge of which might already exist in target domains.
  • 54. Ohio Center of Excellence in Knowledge-Enabled Computing Summary • People’s emotions can be gleaned from their texts using machine learning techniques. – The combination of n-grams (n=1,2), knowledge-based and syntactic features achieves the best performance. – Knowledge features and syntactic features become less important on large training data. • We can automatically create a large training dataset for emotion identification by leveraging emotion hashtags on Twitter. – A large amount of labeled data are collected with little effort and cost – Covers a variety of situations that elicit emotions – Performance gain with increasing size of training data • This self-labeled emotion dataset can be used to improve emotion identification in text from other domains/data sources. – Domain adaptation via selecting tweets that are informative to the target domain – It is superior to select source instances that cannot be correctly classified. – Informativeness of a source instance is measured by three factors: consistency, diversity and similarity. 54
  • 55. Ohio Center of Excellence in Knowledge-Enabled Computing Publications • Wenbo Wang, Lei Duan, Anirudh Koul, Amit P. Sheth. YouRank: Let User Engagement Rank Microblog Search Results. In the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM'14) 2014 • Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Cursing in English on Twitter. In ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW'14) 2014 • Amit Sheth, Ashutosh Jadhav, Pavan Kapanipathi, Lu Chen, Hemant Purohit, Gary Alan Smith, and Wenbo Wang. "Twitris: A system for collective social intelligence." In Encyclopedia of Social Network Analysis and Mining, pp. 2240-2253. Springer New York, 2014. • Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Users Equal in Predicting Elections? A Study of User Groups in Predicting 2012 U.S. Republican Presidential Primaries. In Proceedings of the Fourth International Conference on Social Informatics (SocInfo'12) 2012 • Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter ‘Big Data’ for Automatic Emotion Identification. 2012 ASE International Conference on Social Computing (SocialCom 2012), 2012 • Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit P. Sheth. Extracting Diverse Sentiment Expressions with Target-dependent Polarity from Twitter. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM), 2012 55
  • 56. Ohio Center of Excellence in Knowledge-Enabled Computing Publications • Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit P. Sheth. Discovering Fine-grained Sentiment in Suicide Notes. Biomedical Informatics Insights, 2012 • Ramakanth Kavuluru, Christopher Thomas, Amit Sheth, Victor Chan, Wenbo Wang, Alan Smith, An Up-to-date Knowledge-Based Literature Search and Exploration Framework for Focused Bioscience Domains, IHI 2012 - 2nd ACM SIGHIT Intl Health Informatics Symposium, January 28- 30, 2012. • Wenbo Wang, Christopher Thomas, Amit Sheth, Victor Chan. Pattern-Based Synonym and Antonym Extraction. 48th ACM Southeast Conference, ACMSE2010, Oxford Mississippi, April 15- 17, 2010 • Christopher J. Thomas, Wenbo Wang, Pankaj Mehra, Delroy Cameron, Pablo N. Mendes, and Amit P. Sheth.. What Goes Around Comes Around – Improving Linked Opend Data through On-Demand Model Creation. In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, April 26-27th, 2010, Raleigh, NC: US. • Ashutosh Jadhav, Wenbo Wang, Raghava Mutharaju, Pramod Anantharam, Vinh Nyugen, Amit P. Sheth, Karthik Gomadam, Meenakshi Nagarajan, and Ajith Ranabahu, Twitris: Socially Influenced Browsing, Semantic Web Challenge 2009, demo at 8th International Semantic Web Conference, Oct. 25-29 2009, Washington, DC, USA 56
  • 57. Ohio Center of Excellence in Knowledge-Enabled Computing Patents & Proposal • Wenbo Wang, Lei Duan. "Temporal User Engagement Features", U.S. Patent No. 20,150,120,753. 30 Apr. 2015. • Lu Chen, Wenbo Wang, Amit Sheth. "Topic-specific Sentiment Extraction", U.S. Patent No. 20,140,358,523. 4 Dec. 2014. • Context-Aware Harassment Detection on Social Media. NSF proposal 57
  • 58. Ohio Center of Excellence in Knowledge-Enabled Computing Special thanks to AFRL and NSF 58 Credit, credit *Part of this material is based upon work supported by the National Science Foundation under Grant IIS-1111182 `` SoCS: Collaborative Research: Social Media Enhanced Organizational Sensemaking in Emergency Response.''
  • 59. Ohio Center of Excellence in Knowledge-Enabled Computing 59 Thank You! & Questions?