Dissertation Defense:
" Mining and Analyzing Subjective Experiences in User Generated Content "
By Lu Chen
Tuesday, April 9, 2016
Dissertation Committee: Dr. Amit Sheth, Advisor, Dr. T. K. Prasad, Dr. Keke Chen, Dr. Ingmar Weber, and Dr. Justin Martineau,
Pictures: https://www.facebook.com/Kno.e.sis/photos/?tab=album&album_id=1225911137443732
Video: https://youtu.be/tzLEUB-hggQ
Lu's Home page: http://knoesis.wright.edu/researchers/luchen/
ABSTRACT
Web 2.0 and social media enable people to create, share and discover information instantly anywhere, anytime. A great amount of this information is subjective information -- the information about people's subjective experiences, ranging from feelings of what is happening in our daily lives to opinions on a wide variety of topics. Subjective information is useful to individuals, businesses, and government agencies to support decision making in areas such as product purchase, marketing strategy, and policy making. However, much useful subjective information is buried in ever-growing user generated data on social media platforms, it is still difficult to extract high quality subjective information and make full use of it with current technologies.
Current subjectivity and sentiment analysis research has largely focused on classifying the text polarity -- whether the expressed opinion regarding a specific topic in a given text is positive, negative, or neutral. This narrow definition does not take into account the other types of subjective information such as emotion, intent, and preference, which may prevent their exploitation from reaching its full potential. This dissertation extends the definition and introduces a unified framework for mining and analyzing diverse types of subjective information. We have identified four components of a subjective experience: an individual who holds it, a target that elicits it (e.g., a movie, or an event), a set of expressions that describe it (e.g., "excellent", "exciting"), and a classification or assessment that characterize it (e.g., positive vs. negative). Accordingly, this dissertation makes contributions in developing novel and general techniques for the tasks of identifying and extracting these components.
We first explore the task of extracting sentiment expressions from social media posts. We propose an optimization-based approach that extracts a diverse set of sentiment-bearing expressions, including formal and slang words/phrases, for a given target from an unlabeled corpus. Instead of associating the overall sentiment with a given text, this method assesses the more fine-grained target-dependent polarity of each sentiment expression. Unlike pattern-based approaches which often fail to capture the diversity of sentiment expressions due to the informal nature of language usage and writing style in social media posts, the proposed approach is capable of identifying sentiment phrase
Mining and Analyzing Subjective Experiences in User-generated Content
1. Lu Chen
Kno.e.sis Center
Ph.D. Dissertation Defense
Advisor:
Prof. Amit P. Sheth
Committee members:
Prof. T.K. Prasad
Prof. Keke Chen
Dr. Ingmar Weber (QCRI)
Dr. Justin Martineau (SRA)
Ohio Center of Excellence in Knowledge-Enabled Computing
Mining and Analyzing Subjective
Experiences in User Generated
Content
2. Subjective Experience –
What We Experience in Our Mind
Hunger
Love
Happiness
Surprise
Embarrassment
Like
Dislike
Confused
Pain
Tired Stressed
Nervous
Relaxed
Warm
Proud
Confident
Taste of ice cream
Feeling about sky
Perception of time Appreciation of music
Opinion on climate change
InterestSource: http://bit.ly/1DvofHX
2
Music preference
Purchase intent
3. Subjective Information – The Information
about People’s Subjective Experiences
Source: http://bit.ly/1GDD9Mb
Source: http://bit.ly/1KkJF2l
Source: http://bit.ly/1IjjBSX
Source: http://bit.ly/1KkK1Gc
The traditional way of collecting subjective information:
3
4. User Generated Content
• New opportunities arise as we now can obtain a wide variety of
subjective information from user generated content.
4
5. The Demand of Subjective Information
• Subjective information can be used to support better decision-
making.
5
Source: http://twitris2.knoesis.org/debate
Predicting election results
Source: http://bit.ly/1gQg5Fl
Monitoring social phenomena
Source: http://bit.ly/1niFkU7
Targeted advertising
Source: http://bit.ly/1l0ombo
Making purchase decision
Source: http://bit.ly/1VzYEZG
6. Different Types of Subjective Information
Intent “would like to watch”
Expectation “hope it’s good”
would like to watch The Secret Life Of
Pets. I hope it's good.
"The Secret Life of Pets" was clever,
adorable, funny and I already want to
see it again.
I don't think watching The Secret Life
of Pets makes me childish. I laughed I
cried and it was so touching for
someone who has a pet like me.
Finding Dory was much better than
The Secret Life of Pets. Still not as good
as Zootopia though.
6
The Secret Life of Pets soundtrack
should be nominated for an Oscar
Sentiment “clever, adorable, funny”
Intent “want to see it again”
Opinion
“don’t think watching …
makes me childish”
Emotion
“I laughed I cried and it
was so touching”
Preference “much better than”
Preference “not as good as”
Opinion “should be nominated
for an Oscar”
7. Defining Subjective Information
cesh ,,,
Formally, a subjective experience can be represented as a quadruple
𝒉 − a holder, an individual who holds the experiences
𝒔 − a stimulus (or target), an entity, event or situation that elicits
the experiences.
𝒆 − a set of expressions that are used to describe the experience, e.g.,
the sentiment words/phrases or the opinion claims.
𝒄 − a classification or assessment that categorizes or measures the
exeprience, e.g., sentiment orientation (positive vs. negative), emotion
type (joy, anger, sadness, surprise, etc.), a score indicating the strength
of sentiment.
7
8. Different Types of Subjective Information
8
𝐇𝐨𝐥𝐝𝐞𝐫 𝒉 𝐒𝐭𝐢𝐦𝐮𝐥𝐮𝐬 𝐬 𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝒄
Sentiment
an individual who
holds the sentiment
an entity
sentiment
words/phrases
positive, negative,
neutral
Opinion
an individual who
holds the opinion
an entity
opinion claims (may not
contain sentiment words)
positive, negative,
neutral
Emotion
an individual who
holds the emotion
an event or
situation
emotion words/phrases,
description of
events/situations
anger, disgust, fear,
happiness, sadness,
surprise
Preference
an individual who
holds the preference
a set of
alternatives
words/phrases that
indicate comparison or
preference
depend on specific
tasks
Intent
an individual who
holds the intent
an action
words/phrases that show
the presence of will,
description of the act
depend on specific
tasks
Expectation
an individual who
holds the
expectation
an entity
words/phrases that
express the beliefs about
someone or something
will be.
depend on specific
tasks
9. 9
* The holders of these experiences are the authors of the messages.
Example Type 𝐒𝐭𝐢𝐦𝐮𝐥𝐮𝐬 𝐬 𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆 𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝒄
would like to watch The
Secret Life Of Pets. I hope
it's good.
Intent watch the movie “would like to” transactional
Expectation The Secret Life of
Pets movie
“hope” optimistic
"The Secret Life of Pets" was
clever, adorable, funny and I
already want to see it again.
sentiment The Secret Life of
Pets movie
“clever”, “funny”,
“adorable”
positive
Intent see the movie “want to” transactional
I don't think watching The
Secret Life of Pets makes me
childish. I laughed I cried
and it was so touching for
someone who has a pet like
me.
Opinion The Secret Life of
Pets movie
“don’t think …
makes me
childish”
positive
Emotion The Secret Life of
Pets movie
“laughed”, “cried”,
“so touching”
funny, touching
Finding Dory was much
better than The Secret Life
of Pets. Still not as good as
Zootopia though.
preference Finding Dory, The
Secret Life of Pets
“much better
than”
preferring Finding
Dory
preference Finding Dory,
Zootopia
“not as good as” Preferring
Zootopia
The Secret Life of Pets
soundtrack should be
nominated for an Oscar
Opinion The Secret Life of
Pets soundtrack
“should be
nominated for an
Oscar”
positive
10. 10
An overview of subjective
information extraction.
The box colored in orange
indicate the scope of this
dissertation.
11. Dissertation Focus
1. Extraction of Target-
Specific Sentiment
Expressions (ICWSM’12)
2. Discovery of Domain-
Specific Features and
Aspects (NAACL’16)
Emotion Identification
(SocialCom’12, BII’12,
CSCW’14, ACL’14)
3. Application: Predicting
Election Results (SocInfo’12)
• Identifying and extracting subjective information from
user generated content.
11
4. Application: Religiosity &
Happiness (SocInfo’14)
Sentiment
Opinion
Emotion
Subjective
Information
𝐒𝐭𝐢𝐦𝐮𝐥𝐢 𝐬
𝐄𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧 𝒆
𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐜
Holder 𝒉
12. Thesis Statement
• This dissertation presents a unified framework that characterizes a
subjective experience, such as sentiment, opinion, or emotion, in terms
of an individual holding it, a target eliciting it, a set of expressions
describing it, and a classification or assessment measuring it;
• it describes new algorithms that automatically identify and extract
sentiment expressions and opinion targets from user generated content
with minimal human supervision;
• it shows how to use social media data to predict election results and
investigate religion and subjective well-being, by classifying and
assessing subjective information in user generated content.
12
13. Sentiment in User Generated Content
Sources: Social media
Data: posts, messages
Targets: movies, persons,
brands, etc.
13
E1. Lights out definitely lived up to the hype! Great movie!
E2. I got my second Pikachu today this one was from 2k egg revitalised my
love for Pokemon go... Did not last long 😆 stoopid game
E3. Game of Thrones is a must watch.
E4. I find myself grateful that Hillary Clinton is predictable and steady. Like
her or don't, she's SAFE.
E5. Saw the avengers last night. Mad overrated. Cheesy lines and horrible
writing. Very predictable.
E6. I saw The Avengers yesterday evening. It was long but it was very good!
E7. Galaxy s7 edge battery life last so long it's almost unlimited battery life
xD
Target
Lights out 75% 20% 5%
Pokemon Go 69% 17% 14%
Game of Thrones 83% 10% 7%
Hillary Clinton 49% 35% 16%
The Avengers 70% 24% 6%
Galaxy S7 Edge 68% 16% 16%
Sentiment Analysis Predictive Models
business
analytics,
predicting
financial
performance,
predicting
election
results
…
14. 1. Extraction of Target-
Specific Sentiment Expressions
14
Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit Sheth. Extracting Diverse Sentiment
Expressions with Target-dependent Polarity from Twitter. Proceedings of the 6th International AAAI
Conference on Weblogs and Social Media (ICWSM), 2012.
Given a set of unlabeled social media posts, how to
extract diverse forms of sentiment expressions with
respect to a specific target?
15. Example
E1. Lights out definitely lived up to the hype! Great movie!
E2. I got my second Pikachu today this one was from 2k egg revitalised my
love for Pokemon go... Did not last long 😆 stoopid game
E3. Game of Thrones is a must watch.
E4. I find myself grateful that Hillary Clinton is predictable and steady. Like
her or don't, she's SAFE.
E5. Saw the avengers last night. Mad overrated. Cheesy lines and horrible
writing. Very predictable.
E6. I saw The Avengers yesterday evening. It was long but it was very good!
E7. Galaxy s7 edge battery life last so long it's almost unlimited battery life
xD
Instances Sentiment Expressions Classification
E1 lived up to the hype, great positive
E2 love, not last long, stoopid positive, negative
E3 must watch positive
E4 grateful, predictable, steady, safe positive
E5
mad overrated, cheesy, horrible,
very predictable
negative
E6 long, very good negative, positive
E7 last so long, unlimited positive
Sources: Social media
Data: posts, messages
e.g., tweets
Targets: movies, persons,
brands, etc.
15
16. Challenges
• Sentiment expressions can be very diverse.
‒ Vary from single words (e.g., “good”, “predictable”) to multi-word phrases
of different lengths (“lived up to the hype”, “must see”)
‒ Can be formal or slang expressions, including abbreviations and spelling
variations (e.g., “gud”, “stoopid”).
• The polarity of a sentiment expression is sensitive to its target.
‒ E.g., “long” in “long river”, “long battery life”, or “long time for
downloading”.
‒ E.g., “predictable” regarding movies, or regarding stocks.
16
17. Contributions
We propose a novel optimization-based approach that:
• identifies a diverse and richer set of sentiment expressions,
including both formal and slang words/phrases;
• assesses the target-dependent polarity of each sentiment
expression; and
• does not require labeled data or hand-crafted patterns.
17
19. Example:
“The Avengers movie was bloody amazing! A little cheesy at times, but I
liked it. Mmm looking good Robert Downey Jr and Captain America ;)”
“on-target” subjective words: “bloody”, “amazing”, “cheesy”, “liked”
Candidate expressions: “bloody”, “amazing”, “bloody amazing”, “cheesy”,
“little cheesy”, “cheesy at times”, “little cheesy at times”, “liked”
Method:
• For each message, selecting the “on-target” subjective words, and
extracting all the n-grams that contain at least one selected subjective
word as candidates.
• A subjective word is selected as “on-target” if
(1) there is a dependency relation between the word and the target, or
(2) the word is proximate to the target (e.g., within four words distance).
19
Extracting Candidate Expressions
20. Identifying Inter-Expression Relations
1. I saw The Avengers yesterday evening. It was long but it was very good!
2. I do enjoy The Avengers, but it's both overrated and problematic.
3. Saw the avengers last night. Mad overrated. Cheesy lines and horrible
writing. Very predictable.
4. The avengers was good but the plot was just simple minded and predictable.
5. The Avengers was good. I was not disappointed.
20
22. An Optimization Model (1)
• For each candidate expression ,
‒ P-Probability – the probability that indicates positive sentiment
‒ N-Probability – the probability that indicates negative
sentiment
• For each pair of candidate expressions and ,
‒ Consistency probability – the probability that and have the same
polarity:
‒ Inconsistency probability – the probability that and have different
polarities:
ic
)(Pr i
P
c
)(Pr i
N
c
ic
ic
1)(Pr)(Pr i
N
i
P
cc
ic jc
ic jc
)(Pr)(Pr)(Pr)(Pr),(Pr j
N
i
N
j
P
i
P
ji
cons
cccccc
ic jc
)(Pr)(Pr)(Pr)(Pr),(Pr j
P
i
N
j
N
i
P
ji
incons
cccccc
22
23. An Optimization Model (2)
• We want the consistency and inconsistency probabilities derived from
the P-Probabilities and N-Probabilities of the candidates to be closest to
their expectations suggested by the relation networks.
• Objective Function:
1
1
22
),(Pr1),(Pr1minimize
n
i
n
ij
ji
inconsincons
ijji
conscons
ij ccwccw
where and are the weights of the edges (strength of the
relations) between and in the consistency and inconsistency relation
networks, and n is the total number of candidate expressions.
ic jc
cons
ijw incons
ijw
)(Pr)(Pr)(Pr)(Pr),(Pr j
N
i
N
j
P
i
P
ji
cons
cccccc
)(Pr)(Pr)(Pr)(Pr),(Pr j
P
i
N
j
N
i
P
ji
incons
cccccc
23
24. Experiments: Datasets
Table: Description of four
target-specific datasets from
social media.
24
Tweet about movie New Star Trek movie is great! Highly recommend it!
Tweet about person Scarlett Johansson rocking a suit better than most men.
Forum post about
epilepsy treatment
I have an 11 month old who suffers from 0-8 seizures per day. We've tried 6
medications that have all failed and are now on The Ketogenic Diet. The diet has
been amazing at reducing the frequency and intensity of his seizures. However, I
want them GONE! I am wondering if infant chiropractic care or acupuncture is safe
and effective in eliminating seizures. Does anyone have any experience with either
of these?
Forum post about
cellular company
I click on Mobile Sync to move all my contacts from my phone to the Sprint website.
There are over 100 contacts in my phone, but it's only moving 59 of them? Help
Facebook post
about automobile
company
I have a 2006 Trailblazer that had a motor failure at 60,000 miles. GM refused to
help in any way. Poor customer service to say the least. I guess they don't care about
your car post warranty. With a driveway full of GM's its probably the last one I will
buy.
25. Experiments on Tweets
• Datasets:
‒ 168,005 tweets about movies
‒ 258,655 tweets about persons
• Gold standard: 1500 tweets were randomly sampled from each domain.
Human experts identified sentiment expressions and labeled each
expression and tweet with target-specific sentiment.
Table: Distributions of N-
grams and Part-of-speech of
the Sentiment Expressions in
the Gold Standard Data Set.
Table: Distribution of
Sentiment Categories of the
Tweets in the Gold Standard
Data Set.
25
26. Methods
COM -- Constrained Optimization Model
• COM-const: Assign 0.5 to all the candidates as their initial P-
Probabilities.
• COM-gelex: Initialize the candidates’ polarities according to the
subjectivity dictionary. (positive-1.0, negative-0.0, other-0.5)
• MPQA, GI, SWN: For each extracted subjective word regarding
the target, simply look up its polarity in MPQA, General Inquirer
and SentiWordNet, respectively.
• PROP: a propagation approach proposed by Qiu et al. (IJCAI’09)
26
27. Results
27
It demonstrates the advantage of our
optimization-based approach over
the lexicon-based or rule-based
manner in polarity assessment – our
method extracts diverse sentiment
expressions and capture their target-
dependent polarity.
28. Results of Sentiment Expression Extraction with Various Corpora Sizes
Our approach make increases on both
precision and recall when we increase the
size of corpora from 12,000 to 48,000.
Because our method could benefit from
more relations extracted from larger corpora.
28
29. • Datasets:
‒ 100 forum posts about epilepsy treatment
‒ 162 forum posts about cellular company
‒ 200 Facebook posts about automobile company
• Gold standard: human experts identified sentiment expressions from
posts, and labeled each expression and post sentence with target-
specific sentiment.
29
Experiments on Other Social Media Posts
Table: Characteristics of
sentiment expressions
in the Gold Standard
Data Set.
Table: Distribution of
Sentiment Categories of
post sentences in the
Gold Standard Data Set.
30. Results
30
Table: Quality of the extracted sentiment
expressions.
Figure: Sentence-level sentiment
classification accuracy using different
lexicons.
The stable performance on all five datasets provides a strong indication that
the proposed approach is not limited to a specific domain or a specific social
media data source.
32. Aspect-based Opinion Mining
It would be helpful to have an aspect-based opinion summarization for products.
…
Size
picture quality
motion-smoothing
sound quality
big screen perfect size fits big bedroom …
full hd best picture blur reduction …
smooth motion sensor tracing effects …
loud white noise high pitched sound …
32
33. 2. Discovery of Domain-
Specific Features and Aspects
33
Lu Chen, Justin Martineau, Doreen Cheng and Amit Sheth. Clustering for Simultaneous Extraction of
Aspects and Features from Reviews. Proceedings of the 15th Annual Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL),
2016.
Given a set of plain product reviews, how to efficiently
identify (both explicit and implicit) product features
and group them into aspects?
34. Example
Review Sentences
1. Phone is easy to use and has great features. Large
screen is great. Great speed makes smooth
viewing of tv programs or sports.
2. It has a big bright display, it's very fast and very
lightweight for its size.
3. Good features for an inexpensive android, light,
good signal, good sound, pretty quick for a
800MHz processor.
4. The phone runs extra fast and smooth, and has
great price.
Aspects
{screen, display, bright}
{size, large, big}
{lightweight, light}
{price, inexpensive}
{speed, processor, fast,
quick, smooth}
{easy, use}
{features}
{signal}
{sound}
Feature: components and attributes of a product.
• Explicit feature: mentioned as a opinion target
• Implicit feature: implied by opinion words
• Different feature expressions may be used to describe the same aspect
of a product.
Aspect: represented as a group of features 34
35. • Two-step approach: first identifying features, then clustering
them
• Feature Identification
‒ Only extract features but not group them.
‒ Implicit features have been largely ignored.
‒ Require seed terms, hand-crafted rules/patterns, or other annotation
efforts.
• Feature Clustering/Aspect Discovery
‒ Assume that features have been identified beforehand.
‒ Topic-model based approach
o not fine-grained aspects (Zhang and Liu, 2014), not directly
interpretable as aspects (Chen et al., 2013; Bancken et al., 2014),
not good at dealing with aspect sparsity (Xu et al., 2014), etc.
‒ Clustering-based approach (Su et al., 2008; Lu et al., 2009; Bancken et
al., 2014)
Related Work
35
36. Contributions
We propose a new clustering-based approach that:
• identifies both features and aspects simultaneously;
• extracts both explicit and implicit features and groups them into
aspects; and
• does not require seed terms, hand-crafted patterns, or any other
labeling efforts.
36
37. Notation
is a set of
candidate features, which are extracted
from reviews of a given product.
o Candidate of explicit features: noun and
noun phrases
o Candidate of implicit features: adjectives
and verbs
is the number of aspects.
is the number of most
frequent candidates that will be
grouped first to generate the seed
clusters.
is the upper bound of the distance
between two mergeable clusters.
(1) To generate high quality seed clusters:
Frequent terms are more likely the
actual features of customers' interests.
(2) Speed up the process by clustering only
the most frequent ones.
Domain-specific similarity measure:
determine how similar the members in two
clusters are regarding the particular
domain/product.
Merging constraints: further ensure that
the terms from different aspects would not
be merged
The Clustering Algorithm
37
38. • General semantic similarities that are learned from thesaurus
dictionaries or web corpus.
‒ The similarities between words/phrases are domain dependent.
E.g., “ice cream sandwich'' and “operating system” (cell-phone domain)
“smooth” and “speed” (cell-phone domain vs. hair dryer domain)
• Domain-dependent similarities that are learned from a domain-
specific corpus based on distributional information.
‒ Different aspects may share similar context.
E.g., “great display”, “great price”, “great speed”
‒ The words describing the same aspect may not share similar context or
co-occur.
E.g., people use “is inexpensive” or “has great price” instead of “has
inexpensive price”; “running fast” or “great speed” instead of “fast speed”
Similarity Measures
38
39. Domain-specific Similarity
• General similarity matrix G -- a n × n matrix, where Gij is the general semantic
similarity between xi and xj , Gij ∈ [0, 1], Gij = 1 when i=j, and Gij = Gji.
• Use UMBC Semantic Similarity Service to get G.
• Statistical association matrix T -- a n × n matrix, where Tij is the pairwise
statistical association between xi and xj in a domain-specific corpus, Tij ∈ [0,
1], Tij = 1 when i=j, and Tij = Tji.
• Use normalized pointwise mutual information (NPMI) to get T.
39
- f(xi) (or f(xj)) is the number of documents where xi (or xj) appears,
- f(xi, xj) is the number of documents where xi and xj co-occur in a sentence,
- N is the total number of documents in the corpus.
NPMI(xi, xj) ∈ [−1, 1], and we rescale the values of NPMI to the range of [0, 1].
40. • A candidate xi can be represented by the i-th row in G or T.
40
where
• The domain-specific similarity between xi and xj is defined as the weighted
sum of the similarity metrics:
simg captures semantically similar/relevant words,
e.g., “screen” and “display”, “speed” and “fast”.
simt captures words sharing similar context, e.g.,
“ice cream sandwich” and “operating system”.
simgt gets high value when the terms strongly associated with xi (or xj) are
semantically similar to xj (or xi), e.g., “smooth” and “speed”.
Domain-specific Similarity
41. • We evaluate this approach on reviews from three different domains.
• The default setting of CAFE (Clustering for Aspect and Feature Extraction):
‒ The number of aspects k = 50
‒ Distance upper bound 𝛿 = 0.8
‒ The number of candidates that are grouped first to generate seed clusters s = 500
‒ The weights of three similarity measures wg = wt = 0.2, wgt = 0.6
41
Data and Experimental Setting
42. • PROP: A double propagation approach that extracts features using hand-
crafted rules based on dependency relations between features and opinion
words. (Qiu et al., IJCAI’09)
• LRTBOOT: A bootstrapping approach that extracts features by mining
pairwise feature-feature, feature-opinion, opinion-opinion associations
between terms in the corpus, where the association is measured by the
likelihood ratio tests (Hai et al., CIKM’12)
Evaluations on Feature Extraction – Methods
42
44. • MuReinf: A clustering method utilizes the mutual reinforcement
association between features and opinion words to iteratively group them
into feature clusters and opinion clusters. (Su et al., WWW’08)
• L-EM: A semi-supervised learning method that adapts Naive Bayesian-
based EM algorithm to group synonym features into categories. (Zhai et al.,
WSDM’11)
• L-LDA: This is a baseline method used in (Zhai et al., WSDM’11), which is
based on LDA.
* Because MuReinf, L-EM and L-LDA need another algorithm to extract
features, both the LRTBOOT and CAFE is applied.
Evaluations on Aspect Discovery – Methods
44
45. Evaluations on Aspect Discovery – Results
45
The results showed the advantage of combining feature and aspect discovery
over chaining them, and also implied the effectiveness of our domain-specific
similarity measure in identifying synonym features in a particular domain.
46. Influence of Parameters
46
Based on the experiments on three domains, the best results can be achieved when
distance upper bound 𝜹 is set to a value between 0.76 and 0.84.
CAFE generates better results by first clustering the top 10%-30% most frequent
candidates.
The best F-score and Rand Index can be achieved when we set wgt to 0.5 or 0.6 across all
three domains.
48. 3. Harnessing Public Opinion
on Twitter to predict election
results
48
Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Users Equal in Predicting Elections? A Study of User
Groups in Predicting 2012 U.S. Republican Presidential Primaries. Proceedings of the 4th International
Conference on Social Informatics (SocInfo) 2012.
How to derive public opinion about election candidates?
Are opinion holders equal in predicting elections?
49. Overview
49
Tweet ID
candidate: XXX
opinion:
positive
User category:
right-leaning
high engagement
opinion prone
orig. tweet-prone
a user
tweets
network
2. Engagement Degree 4. Tweet Mode
3. Content Type1. Political Preference
Predicting which
candidate this user
support
Aggregating opinions
of each user group to
predict election results
50. Contributions
• We introduce a new method to predict the election results that:
‒ identifies which candidate is mentioned, and whether a positive or
negative opinion is expressed towards a candidate in a tweet;
‒ predicts which candidate a user supports based on the opinions extracted
from his/her tweets; and
‒ aggregates the opinions of all users from a group to predict which
candidate will win the election.
• We show that the opinion holders matter in predicting election
results.
‒ We group users based on their political preference, engagement degree,
tweet mode, and content type, and examine the predictive power of
different user groups in predicting Super Tuesday results in 10 states.
‒ We evaluate the results in terms of both the accuracy of predicting
winners and the error rate between the predicted votes and the actual
votes for each candidate.
50
51. Findings
51
Revealing the challenge of
identifying the opinion of “silent
majority”
Retweets may not necessarily
reflect users' attitude.
Prediction of user’s vote based on
more opinion tweets is not
necessarily more accurate than the
prediction using more information
tweets
The right-leaning user group provides
the most accurate prediction result. In
the best case (56-day time window), it
correctly predict the winners in 8 out
of 10 states with an average
prediction error of 0.1.
52. 4. Religion and Subjective Well-
being
52
Lu Chen, Ingmar Weber and Adam Okulicz-Kozaryn. U.S. Religious Landscape on Twitter. Proceedings of
the 6th International Conference on Social Informatics (SocInfo), 2014.
Lu Chen, Ingmar Weber, Adam Okulicz-Kozaryn, and Amit Sheth. Understanding the Effect of Religion
on Happiness by Examining the Topic Preferences and Word Usage on Twitter. (in submission to PLOS
ONE).
How to use Twitter data to measure subjective well-
being? How does the religious belief of users
(holders) affect their happiness expressed in tweets?
53. 53
user’s religious
belief: Buddhism
a user
tweets
network
user ID
happiness_level: ℎ 𝑎𝑣𝑔 𝑢𝑠𝑒𝑟
topic_preference: 𝑝 𝑡𝑜𝑝𝑖𝑐 𝑢𝑠𝑒𝑟
word_preference: 𝑝(𝑤𝑜𝑟𝑑|𝑡𝑜𝑝𝑖𝑐, 𝑢𝑠𝑒𝑟)
Religion: Buddhism
happiness_level: ℎ 𝑎𝑣𝑔 𝑔𝑟𝑜𝑢𝑝
topic_preference: 𝑝 𝑡𝑜𝑝𝑖𝑐 𝑔𝑟𝑜𝑢𝑝
word_preference: 𝑝(𝑤𝑜𝑟𝑑|𝑡𝑜𝑝𝑖𝑐, 𝑔𝑟𝑜𝑢𝑝)
Overview
aggregating the measures of individual
users to obtain the group-level measures
1. What is the effect of religion on happiness?
2. How does topic preference and word usage
affect the happiness expressed by each group?
54. Contributions
• We provide a fresh perspective about happiness and religion,
complementing traditional survey-based studies, via analyzing the
topics and words naturally disclosed in people's social media messages.
• We introduce a framework and methodology that explore the effect of
social and demographic factors of a holder (e.g., a holder’s religious
belief) on subjective well-being.
• Our method also explores potential reasons for the variations in the
level of happiness from the holder’s topic preferences and word usage
on topics.
54
55. Findings
• There is a significant difference among the seven groups (atheist,
Buddhist, Christian, Hindu, Jew, Muslim, and random Twitter users) on
the level of happiness (pleasant/unpleasant emotions) expressed in
tweets.
• Each user group has different topic preferences and different word
usage on the same topic. However, differences on word usage are small
compared with the differences on topic distributions.
• The users' topic preferences strongly correlate with their happiness
expressed in tweets.
55
56. Conclusion
• This dissertation presents a unified framework that characterizes a
subjective experience, such as sentiment, opinion, or emotion, in terms
of an individual holding it, a target eliciting it, a set of expressions
describing it, and a classification or assessment measuring it;
• it describes new algorithms that automatically identify and extract
sentiment expressions and opinion targets from user generated content
with minimal human supervision;
• it shows how to use social media data to predict election results and
investigate religion and subjective well-being, by classifying and
assessing subjective information in user generated content.
56
57. Future Directions
57
Time
1. Detecting different types of
subjectivity in text
2. Beyond sentiment and opinion
3. Towards dynamic modeling of
subjective information.
A subjective experience is a
quintuple , where
t is the time when the subjective
experience occurs.
tcesh ,,,,
58. Publications
• Lu Chen, Justin Martineau, Doreen Cheng and Amit Sheth. Clustering for Simultaneous Extraction of Aspects and Features from
Reviews. Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL), 2016. (Acceptance rate: 24%)
• Lu Chen, Ingmar Weber and Adam Okulicz-Kozaryn. U.S. Religious Landscape on Twitter. Proceedings of the 6th International
Conference on Social Informatics (SocInfo), 2014. (Acceptance rate: 23%)
• Justin Martineau, Lu Chen, Doreen Cheng and Amit Sheth. Active Learning with Efficient Feature Weighting Methods for Improving
Data Quality and Classification Accuracy. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics
(ACL), 2014. (Acceptance rate: 26%)
• Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Cursing in English on Twitter. Proceedings of the 17th ACM
Conference on Computer Supported Cooperative Work and Social Computing (CSCW) 2014. (Acceptance rate: 27%)
• Amit Sheth, Ashutosh Jadhav, Pavan Kapanipathi, Lu Chen, Hemant Purohit, Alan Smith, and Wenbo Wang. Chapter title: Twitris - A
System for Collective Social Intelligence. Encyclopedia of Social Network Analysis and Mining, 2014.
• D. Cameron, G. A. Smith, R. Daniulaityte, A. P. Sheth, D. Dave, L. Chen, G. Anand, R. Carlson, K. Z. Watkins, R. Falck. PREDOSE: A
Semantic Web Platform for Drug Abuse Epidemiology Using Social Media. Journal of Biomedical Informatics: Special Issue on
Biomedical Information through the Implementation of Social Media Environments. 2013. PMID: 23892295.
• Lu Chen, Wenbo Wang, Amit P. Sheth. Are Twitter Users Equal in Predicting Elections? A Study of User Groups in Predicting 2012
U.S. Republican Presidential Primaries. Proceedings of the 4th International Conference on Social Informatics (SocInfo) 2012.
(Acceptance rate: 35%)
• Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit P. Sheth. Harnessing Twitter "Big Data" for Automatic Emotion
Identification. Proceedings of the 4th ASE/IEEE International Conference on Social Computing (SocialCom), 2012.
• Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit Sheth. Extracting Diverse Sentiment Expressions with Target-
dependent Polarity from Twitter. Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM),
2012. (Acceptance rate: 20%)
• Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit Sheth. Discovering Fine-grained Sentiment in Suicide Notes. Biomedical
Informatics Insights (BII), 2012.
• R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Perera, L. Chen, and A. Sheth. "I Just Wanted to Tell You That Loperamide WILL
WORK": A Web-Based Study of Extra-Medical Use of Loperamide. Journal of Drug and Alcohol Dependence, 2012.
• R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Udayanga, L. Chen, A. Sheth. A Web-based Study of Self-treatment of Opioid
Withdrawal Symptoms with Loperamide. The College on Problems of Drug Dependence (CPDD), 2012.
58
62. Acknowledgement
62
Prof. Amit Sheth
(Advisor)
Dr. Ingmar Weber
(QCRI)
Prof. T.K.Prasad Dr. Justin Martineau
(SRA)
Prof. Keke Chen
Dissertation Committee
Co-authors and Collaborators
Dr. Shaojun Wang
Computer Science
Dr. Meena Nagarajan
(IBM Watson)
Prof. Adam Okulicz-Kozaryn
(Rutgers-Camden)
Dr. Wenbo Wang
(GoDaddy)
Dr. Doreen Cheng
(SRA)
Prof. Raminta Daniulaityte Dr. Delroy Cameron
(Apple)
Dr. Ming Tan
(IBM Watson)
Prof. Valerie Shalin
64. Acknowledgement
This dissertation is based upon work supported by the National
Science Foundation under Grant:
• IIS-1111182 “SoCS: Collaborative Research: Social Media
Enhanced Organizational Sensemaking in Emergency Response”
and
• CNS-1513721 “Context-Aware Harassment Detection on Social
Media.”
64
Hinweis der Redaktion
Note that the precision may be worse than the true quality obtainable using a larger corpus, since the gold standards are generated from a subset of tweets.