Social media provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens share information, express opinions, and engage in discussions. Often such a Online Citizen Sensor Community (CSC) has stated or implied goals related to workflows of organizational actors with defined roles and responsibilities. For example, a community of crisis response volunteers, for informing the prioritization of responses for resource needs (e.g., medical) to assist the managers of crisis response organizations. However, in CSC, there are challenges related to information overload for organizational actors, including finding reliable information providers and finding the actionable information from citizens. This threatens awareness and articulation of workflows to enable cooperation between citizens and organizational actors. CSCs supported by Web 2.0 social media platforms offer new opportunities and pose new challenges. This work addresses issues of ambiguity in interpreting unconstrained natural language (e.g., ‘wanna help’ appearing in both types of messages for asking and offering help during crises), sparsity of user and group behaviors (e.g., expression of specific intent), and diversity of user demographics (e.g., medical or technical professional) for interpreting user-generated data of citizen sensors. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues in CSC, and allow better accessibility to user-generated data at higher level of information abstraction for organizational actors. This study presents a novel web information processing framework focused on actors and actions in cooperation, called Identify-Match-Engage (IME), which fuses top-down and bottom-up computing approaches to design a cooperative web information system between citizens and organizational actors. It includes a.) identification of action related seeking-offering intent behaviors from short, unstructured text documents using both declarative and statistical knowledge based classification model, b.) matching of intentions about seeking and offering, and c.) engagement models of users and groups in CSC to prioritize whom to engage, by modeling context with social theories using features of users, their generated content, and their dynamic network connections in the user interaction networks. The results show an improvement in modeling efficiency from the fusion of top-down knowledge-driven and bottom-up data-driven approaches than from conventional bottom-up approaches alone for modeling intent and engagement. Several applications of this work include use of the engagement interface tool during recent crises to enable efficient citizen engagement for spreading critical information of prioritized needs to ensure donation of only required supplies by the citizens. The engagement interface application also won the United Nations ICT agency ITU's Young Innovator 2014 award.
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation with Organizations
1. Mining Citizen Sensor Communities to Improve
Cooperation with Organizational Actors
June 23 2015
PhD Defense
Hemant Purohit (Advisor: Prof. Amit Sheth)
Kno.e.sis, Dept. of CSE, Wright State University, USA
2. @hemant_pt
Outline
— Citizen Sensor Communities & Organizations
— Cooperative System Design Challenges
— Contributions
— Problem 1. Conversation Classification using Offline Theories
— Problem 2. Intent Classification
— Problem 3. Engagement Modeling
— Applications
— Limitations & Future Work
2
3. @hemant_pt
Citizen Sensors: Access to Human
Observations & Interactions
Uni-directional communication
(TO people)
Unstructured, Unconstrained Language Data
• Ambiguity
• Sparsity
• Diversity
• Scalability
Bi-directional
(BY people, TO people)
Web 2.0
media
3
4. @hemant_pt
Goal: Data to Decision Making
Organizational Decision Making
Noisy Citizen Sensor data
4
SOCIAL SCIENCE
• Experts on Organizations
• Small-scale Data
COMPUTER SCIENCE
• Experts on Mining
• Large-scale data
Scope of My
Research
5. @hemant_pt
1. No Structured Roles
2. No Defined Tasks
ü But “GENERATE”
Massive Data
1. Structured Roles
2. Defined Tasks
ü COLLECT Data
ü Process, & Make Decisions
ORGANIZATIONS
Sure!
How to help?
CITIZEN
SENSOR
COMMUNITIES
5
COOPERATIVE
SYSTEM
Can you
help us?
7. @hemant_pt
Articulation
Challenges
(Malone & Crowston 1990;
Schmidt & Bannon 1992)
ENGAGEMENT MODELING INTENT MINING
COOPERATIVE
SYSTEM
DATA
PROBLEM
DESIGN
PROBLEM
7
ORGANIZATIONS
CITIZEN
SENSOR
COMMUNITIES
Awareness
Q1. Who to
engage
first?
Org. Actor
Q2. What are
resource needs &
availabilities?
Org. Actor
8. @hemant_pt
Research Questions
— Can general theories of offline conversation be
applied in the online context?
— Can we model intentions to inform organizational
tasks using knowledge-guided features?
— Can we find reliable groups to engage by modeling
collective group divergence using content-based
measure?
8
9. @hemant_pt
Thesis: Statement
Prior knowledge, and
interplay of features of users, their content, and network
efficiently model
Intent & Engagement
for cooperation of citizen sensor communities.
Scope of Concepts
• Intent: aim of action, e.g., offering help
• Engagement: involvement in activity, e.g., participating in discussion
9
10. @hemant_pt
Contributions
1. Operationalized computing in cooperative system design
— by accommodating articulation in Intent Mining, and
— enriching awareness by Engagement Modeling
2. Improved computation of online social data
— by incorporating features from offline social theoretical knowledge
3. Improved performance of intent classification
— by fusing top-down & bottom-up data representations
4. Improved explanation of group engagement
— by modeling content divergence to complement existing structural measures
10
12. @hemant_pt
Outline
— Citizen Sensor Communities & Organizations
— Cooperative System Design Challenges
— Awareness: tackle via Engagement Modeling
— Articulation: tackle via Intent Mining
— Contributions
— Problem 1. Conversation Classification using Offline Theories
— Problem 2. Intent Classification
— Problem 3. Engagement Modeling
— Applications
— Limitations & Future Work
12
13. @hemant_pt
User1. Analyzing #Conversations on Twitter. Using platform provided
functions #REPLY, #RT, and #Mention.
..
…
……..
User2. I kinda feel one might need more than just the platform fn -- @User1 u
can think #Psycholinguistics, dude!
Problem 1. Conversation Classification
— Function of Reply, Retweet, Mention reflect conversation
13
R1. Can general theories of conversation be applied in the online context?
14. @hemant_pt
Problem 1. Conversation Classification
— Function of Reply, Retweet, Mention reflect conversation
— Task: Given a set S of messages mi, Classify a sample {mi}
for {RP, None}, {RT, None}, {MN, None} , where
— Ground-truth corpuses
— RP = { mi | has_Reply_function (mi) = True }
— RT = { mi | has_Retweet_function (mi) = True }
— MN = { mi | has_Mention_function (mi) = True }
— None = S – {RP, RT, MN}
— Sample {mi} size = 3, based on average Reply conversation size
14
15. @hemant_pt
Conversation Classification: Offline
Theories
— Psycholinguistics Indicators [Clark & Gibbs, 1986, Chafe 1987, etc.]
— Determiners (‘the’ vs. ‘a/an’)
— Dialogue Management (e.g., ‘thanks’, ’anyway’), etc.
— Drawback
— Offline analysis focused on positive conversation instances
— Hypotheses
— Offline theoretic features are discriminative
— Such features correlate with information density
15
16. @hemant_pt
Conversation Classification: Feature
Examples
16
CATEGORY Hj Hj SET
H1 - Determiners (the)
H3 - Subject pronouns (she, he, we, they)
H9 - Dialogue management indicators (thanks, yes, ok, sorry, hi, hello, bye,
anyway, how about, so, what do you
mean, please, {could, would, should,
can, will} followed by pronoun)
H11 - Hedge words (kinda, sorta)
• Feature_Hj (mi) = term-frequency ( Hj-set, mi )
• Normalized
• Total 14 feature categories
17. @hemant_pt
Conversation Classification: Results
— Dataset
— Tweets from 3 Disasters, and 3 Non-Disaster events
— Varying set size (3.8K – 609K), time periods
— Classifier:
— Decision Tree
— Evaluation: 10-fold Cross Validation
— Accuracy: 62% - 78% [Lowest for {Mention,None} ]
— AUC range: 0.63 - 0.84
17
Purohit,
Hampton,
Shalin,
Sheth
&
Flach.
In
Journal
of
Computers
in
Human
Behavior,
2013
18. @hemant_pt
Conversation Classification:
Discriminative Features
— Consistent top features across classifiers
— Pronouns (e.g., you, he)
— Dialogue management (e.g., thanks)
— Determiners (e.g., the)
— Word counts
— Positively correlated with RP, RT, MN
— Correlation Coefficient up to 0.69
18
19. @hemant_pt
Conversation Classification:
Psycholinguistic Analysis
— LIWC: Tool for deeper content analysis [Pennebaker, 2001]
— Gives a measure per psychological category
— Categories of interest
— Social Interaction
— Sensed Experience
— Communication
— Analyzed output sets in confusion matrices
Ø Higher values for positive classified conversation
Ø suggests higher information for cooperative intent
19
Purohit,
Hampton,
Shalin,
Sheth
&
Flach.
In
Journal
of
Computers
in
Human
Behavior,
2013
True
Positive
False
Negative
False
Positive
True
Negative
20. @hemant_pt
Conversation Classification:
Lessons
1. Offline theoretic features of conversations exist in the
online environment
Ø Can be applied for computing social data
2. Such features correlate with information density in content
- Reflection of conversation for an intent
20
21. @hemant_pt
Outline
— Citizen Sensor Communities & Organizations
— Cooperative System Design Challenges
— Awareness: tackle via Engagement Modeling
— Articulation: tackle via Intent Mining
— Contributions
— Problem 1. Conversation Classification using Offline Theories
— Problem 2. Intent Classification
— Problem 3. Engagement Modeling
— Applications
— Limitations & Future Work
21
22. @hemant_pt
Thesis: Statement
Prior knowledge, and
interplay of features of users, their content, and network
efficiently model
Intent & Engagement
for cooperation of citizen sensor communities.
22
23. @hemant_pt
Short-text Document Intent
— Intent: Aim of action
DOCUMENT
INTENT
Text
REDCROSS
to
90999
to
donate
10$
to
help
the
victims
of
hurricane
sandy
SEEKING HELP
Anyone know where the nearest #RedCross is? I wanna
give blood today to help the victims of hurricane Sandy
OFFERING HELP
Would like to urge all citizens to make the proper
preparations for Hurricane #Sandy - prep is key - http://
t.co/LyCSprbk has valuable info!
ADVISING
23
24. @hemant_pt
Short-text Document Intent
— Intent: Aim of action
DOCUMENT
INTENT
Text
REDCROSS
to
90999
to
donate
10$
to
help
the
victims
of
hurricane
sandy
SEEKING HELP
Anyone know where the nearest #RedCross is? I wanna
give blood today to help the victims of hurricane Sandy
OFFERING HELP
Would like to urge all citizens to make the proper
preparations for Hurricane #Sandy - prep is key - http://
t.co/LyCSprbk has valuable info!
ADVISING
24
How to identify relevant intent from ambiguous, unconstrained
natural language text?
Relevant intent è Articulation of organizational tasks
(e.g., Seeking vs. Offering resources)
25. @hemant_pt
Intent Classification: Problem
Formulation
— Given a set of user-generated text documents, identify
existing intents
— Variety of interpretations
— Problem statement: a multi-class classification task
approximate f: S ! C , where
C = {c1, c2 … cK}
is a set of predefined K intent classes, and
S = {m1, m2 … mN}
is a set of N short text documents
Focus - Cooperation-assistive intent classes, C= {Seeking, Offering, None}
25
26. @hemant_pt
Intent Classification: Related Work
TEXT CLASSIFICATION
TYPE
FOCUS EXAMPLE
Topic predominant
subject matter
sports or entertainment
Sentiment/Emotion/
Opinion
focus on present state
of emotional affairs
negative or positive;
happy emotion
Intent Focus on action, hence,
future state of affairs
offer to help after floods
e.g., I am going to watch the awesome Fast and Furious movie!! #Excited
26
27. @hemant_pt
Intent Classification: Related Work
DATA TYPE APPROACH FOCUS LIMITED APPLICABILITY
27
Formal text on
Webpages/blogs
(Kröll and Strohmaier 2009, -15;
Raslan et al. 2013, -14)
Knowledge
Acquisition:
via Rules, Clustering
• Lack of large corpora with
proper grammatical structure
• Poor quality text hard to parse
for dependencies
Commercial Reviews,
marketplace
(Hollerit et al. 2013, Wu et al. 2011,
Ramanand et al. 2010, Carlos &
Yalamanchi 2012, Nagarajan et al.
2009)
Classification:
via Rules, Lexical
template based,
Pattern
• More generalized intents
(e.g., ‘help’ broader than ‘sell’)
• Patterns implicit to capture than
for buying/selling
Search Queries
(Broder 2002, Downey et al. 2008,,
Case 2012, Wu et al. 2010,
Strohmaier & Kröll 2012)
User Profiling:
Query Classification
• Lack of large query logs, click
graphs
• Existence of social conversation
28. @hemant_pt
Intent Classification: Challenges
— Unconstrained Natural Language in small space
— Ambiguity in interpretation
— Sparsity of low ‘signal-to-noise’: Imbalanced classes
— 1% signals (Seeking/Offering) in 4.9 million tweets #Sandy
— Hard-to-predict problem:
— commercial intent, F-1 score 65% on Twitter [Hollerit et al. 2013]
@Zuora wants to help @Network4Good with Hurricane Relief. Text SANDY to
80888 & donate $10 to @redcross @AmeriCares & @SalvationArmyUS #help
*Blue: offering intent, *Red: seeking intent
28
29. @hemant_pt
Intent Classification: Types & Features
29
Intent
Binary
Crisis Domain:
- [Varga et al. 2013] Problem vs. Aid (Japanese)
- Features: Syntactic, Noun-Verb templates, etc.
Commercial Domain:
- [Hollerit et al. 2013] Buy vs. Sell intent
- Features: N-grams, Part-of-Speech
Multiclass
Commercial Domain:
- Not on Twitter
31. @hemant_pt
Intent Classification Top-Down:
Binary Classifier - Prior Knowledge
— Conceptual Dependency Theory [Schank, 1972]
— Make meaning independent from the actual words in input
— e.g., Class in an Ontology abstracts similar instances
— Verb Lexicon [Hollerit et al. 2013]
— Relevant Levin’s Verb categories [Levin, 1993]
— e.g., give, send, etc.
— Syntactic Pattern
— Auxiliary & modals: e.g., ‘be’, ‘do’, ‘could’, etc. [Ramanand et al. 2010]
— Word order: Verb-Subject positions, etc.
Purohit,
Hampton,
Bhatt,
Shalin,
Sheth
&
Flach.
In
Journal
of
CSCW,
2014
31
41. @hemant_pt
Intent Classification Hybrid:
Multiclass Classifier – Feature Creation
1. (T) Bag of Tokens -
2. (DK) Declarative Knowledge Patterns
— Domain expert guidance
— Psycholinguistics syntactic & semantic rules
— Expand by WordNet and Levin Verbs
e.g.,
3. (SK) Social Knowledge Indicators
— Offline conversation indicators studied in Problem 1
e.g., Hj = Dialogue Management, Hj-set = {Thanks, anyway,..}
41
(how = yes) ^ (Modal-Set 'can' = yes) ^ (Pronouns except 'you' = yes) ^ (Levin Verb-Set 'give' = yes)
Feature_Hj (mi) = term-frequency ( Hj-set, mi )
Pj = Feature_Pj (mi) = 1 if Pj exists in mi , else 0
TOKENIZER(mi , min, max)
42. @hemant_pt
Intent Classification Hybrid:
Multiclass Classifier - Feature Creation
4. (CTK) Contrast Knowledge Patterns
INPUT: corpus {mi} cleaned and abstracted, min. support, X
For each class Cj
— Find contrasting pattern using sequential pattern mining
OUTPUT: contrast patterns set {P} for each class Cj
5. (CPK) Contrast Patterns: on Part-of-Speech tags of {mi}
42
e.g., unique sequential patterns:
SEEKING: help .* victim .* _url_ .*
OFFERING: anyon .* know .* cloth .*
43. @hemant_pt
Intent Classification Hybrid:
Multiclass Classifier - Feature Creation
Finding CTK: Contrast Knowledge Patterns
For each class Cj
1. Tokenize the cleaned, abstracted text of {mi }
2. Mine Sequential Patterns: SPADE Algorithm
— - Output: sequences of token sets, {P’}
3. Reduce to minimal sequences {P}
4. Compute growth rate & contrast strength for P with all other Ck
5. Top-K ranked {P} by contrast strength
OUTPUT: contrast patterns set {P} for each class Cj
43
gr(P,Cj,Ck) = support (P,Cj) / support (P,Ck) .. (1)
Contrast-Growth (P,Cj,Ck) = 1/(|Cj| -1) ΣCk, k=/=j gr(P,Cj,Ck)/ (1 + gr(P,Cj,Ck)) ..(2)
Contrast-Strength(P,Cj) = support(P,Cj)*Contrast-Growth(P,Cj,Ck) .. (3)
44. @hemant_pt
CORPUS
Set of
short text
documents,
S
FEATURES
Knowledge-driven
features
XT
, y
M_1
M_2
M_K
.
.
.
Subset Xj
T ⊂ S such that, Xj
T includes
all the labeled instances of class Cj for
model M_j
Binarization Frameworks for
Multiclass Classifier: 1 vs. All
P(c2)
P(c1)
X1
T, y1
X2
T, y2
XK
T, yK
P(cK)
44(In 1 vs. 1 framework: K*(K-1)/2 classifiers, for each Cj,Ck pair)
45. @hemant_pt
Intent Classification Hybrid:
Multiclass Classifier - Experiments
— Datasets
— Dataset-1: Hurricane Sandy, Oct 27 – Nov 7, 2012
— Dataset-2: Philippines Typhoon, Nov 7 – Nov 17, 2013
— Parameters
— Base Learner M_j: Random Forest, 10 trees with 100 features
— bi-, tri-gram for (T)
— K=100% & min. support 10% for CTK, 50% for CPK
45
48. @hemant_pt
Lessons
1. Top-down & Bottom-up hybrid approach improves data
representation for learning (complementary) intent classes
— Top 1% discriminative features contained 50% knowledge driven
2. Offline theoretic social conversation (SK) features (the, thanks,
etc.), often removed for text classification are valuable for
intent.
3. There is a varying effect of knowledge types (SK vs. DK vs.
CTK/CPK) in different types of real world event datasets
Ø Culturally-sensitive psycholinguistics knowledge in future
48
49. @hemant_pt
Outline
— Citizen Sensor Communities & Organizations
— Cooperative System Design Challenges
— Awareness: tackle via Engagement Modeling
— Articulation: tackle via Intent Mining
— Contributions
— Problem 1. Conversation Classification using Offline Theories
— Problem 2. Intent Classification
— Problem 3. Engagement Modeling
— Applications
— Limitations & Future Work
49
50. @hemant_pt
Thesis: Statement
Prior knowledge, and
interplay of features of users, their content, and network
efficiently model
Intent & Engagement
for cooperation of citizen sensor communities.
50
51. @hemant_pt
— Engagement: degree of involvement in discussion
— Reliable groups: stay focused and collectively behave to diverge on
topics
Problem 3. Group Engagement Model
51Purohit, Ruan, Fuhry, Parthasarathy, & Sheth. ICWSM 2014
How can organizations find reliable groups to engage for action?
52. @hemant_pt
— Engagement: degree of involvement in discussion
— Reliable groups: stay focused and collectively behave to diverge on topics
— Why & How do groups collectively evolve over time?
1. Define a group from interaction network, g
2. Define Divergence of g: content based in contrast to structure
3. Predict change in the divergence between time slices
— Features of g based on theories of social identity, & cohesion
Problem 3. Group Engagement Model
52Purohit, Ruan, Fuhry, Parthasarathy, & Sheth. ICWSM 2014
53. @hemant_pt
Group Engagement Model:
Integrated Approach Unlike Prior Work
People (User): Participant
of the discussion
Content (Text): Topic of
Interest
Network (Community):
Group around topic
AND
AND
Sources: tupper-lake.com/.../uploads/Community.jpg
http://www.iconarchive.com/show/people-icons-by-aha-soft/user-icon.html
KEY POINT: capture
User Node Diversity
53
54. @hemant_pt
— Candidate Group: Detect in interaction network
— Group Discussion Divergence: Jenson-Shannon Divergence of topic
distribution on group members’ tweets
Group Engagement Model: Discussion
Divergence
where, H(*) = Shannon Entropy
Bt = Latent topic distribution of each tweet t in all members’ tweets |Tg| ,
Bg = mean topic distribution of group g, such that:
54
55. @hemant_pt
Lessons
1. Content Divergence based measure helps explanation of
why groups collectively diverge
— Less diverging group write more social & future action related
content
2. Emerging events such as disasters have higher correlation
with social identity-driven features
Ø Role of social context
55
56. @hemant_pt
Outline
— Citizen Sensor Communities & Organizations
— Cooperative System Design Challenges
— Awareness: tackle via Engagement Modeling
— Articulation: tackle via Intent Mining
— Contributions
— Problem 1. Conversation Classification using Offline Theories
— Problem 2. Intent Classification
— Problem 3. Engagement Modeling
— Applications
— Limitations & Future Work
56
57. @hemant_pt
DISASTER Event
Application-1: Filter Content for
Disaster Response
CITIZEN
Sensors
RESPONSE
Organizations
Me
and
@CeceVancePR
are
coordinating
a
clothing/
food
drive
for
families
affected
by
Hurricane
Sandy.
If
you
would
like
to
donate,
DM
us
Does
anyone
know
how
to
donate
clothes
to
hurricane
#Sandy
victims?
[SEEKING
[OFFERING
Intent-Classifiers
as a Service
57
61. @hemant_pt
Articulation
ENGAGEMENT MODELING INTENT MINING
COOPERATIVE
SYSTEM
61
ORGANIZATIONS
CITIZEN
SENSOR
COMMUNITIES
Awareness
Q1. Who to
engage
first?
Org. Actor
Q2. What are
Resource needs &
availabilities?
Org. Actor
62. @hemant_pt
Limitations & Future Work
— Cooperative System
— CSCW Application specific to domain of crisis
Ø How to create a full What-Where-When-Who knowledge base
— Intent Mining
— Non-cooperation assistive intent classes not considered, as well as
the temporal drift of intent not considered
Ø How to mine actor-level intent beyond document level
— Group Engagement
— Reliable prioritized groups based on Correlation, not Causality
— Interplay of Offline and Online interactions beyond the scope
Ø How to incorporate intent in the group divergence
— Bipartite Intent Graph Matching
— Reducing time complexity of Seeking vs. Offering matching
62
63. @hemant_pt
Conclusion
Prior knowledge, and
interplay of features of users, their content, and network
efficiently model
Intent & Engagement
for cooperation between citizen sensors and organizations in
the online social communities.
63
64. @hemant_pt
Thanks to the Committee Members
64
[Left to Right] Prof. Amit Sheth, (advisor, WSU), Prof. Guozhu Dong (WSU), Prof. Srinivasan
Parthasarathy (OSU), Prof. TK Prasad (WSU), Dr. Patrick Meier (QCRI), Prof. Valerie Shalin (WSU)
Computer Science Social Science
65. @hemant_pt
Acknowledgement,
Thanks and Questions J
— NSF SoCS grant IIS-1111182 to support this work
— Interdisciplinary Mentors especially Prof. John Flach (WSU), Drs. Carlos
Castillo (QCRI), Fernando Diaz (Microsoft), Meena Nagarajan (IBM)
— Kno.e.sis team especially Andrew Hampton from Psychology dept. and
Shreyansh and Tanvi from CSE at Wright State, as well as Yiye Ruan (now
Google) & David Fuhry at the Data Mining Lab, Ohio State University
— Colleagues: Digital Volunteers from the CrisisMappers network, StandBy Task
Force, InCrisisRelief.org, info4Disasters, Humanity Road, Ushahidi, etc. and
the subject matter experts at UN FPA
65
66. @hemant_pt
Ambiguity
Sparsity
Diversity
Scalability
• Mutual Influence in Sparse
Friendship Network
[AAAI ICWSM’12]
• User Summarization with
Sparse Profile Metadata
[ASE SocialInfo’12]
• Matching intent as task of
Information Retrieval [FM’14]
• Knowledge-aware Bi-partite
Matching [In preparation]
• Short-Text Document Intent
Mining [FM’14, JCSCW’14]
• Actor-Intent Mining
Complexity [In preparation]
• Modeling Group Using
Diverse Social Identity &
Cohesion [AAAI ICWSM’14]
• Modeling Diverse User-
Engagement [SOME WWW’11,
ACM WebSci’12]
(Interpretation)
(users)
(behaviors)
66
Other
works