Nell’iperspazio con Rocket: il Framework Web di Rust!
A quality type aware annotated corpus and lexicon for harassment research
1. A Quality Type-aware Annotated Corpus and Lexicon for
Harassment Research
Presenter: Saeedeh Shekarpour
Authors: Mohammadreza Rezvan, Saeedeh
Shekarpour, Lakshika Balasuriya,Krishnaprasad
Thirunarayan, Valerie L. Shalin, Amit Sheth
1
2. Motivation
● Social media is a is being used extensively by people from various age groups
● Participants in such media experience insult, bullying, harassment and etc.
● 66% of internet users who have experience harassment on social media app (18% are serious)
● It poses challenges to social engagement and trust, resulting in emotional distress, privacy
concerns and threats to physical safety
2
6. Lexicon Creation
● We compiled our lexicon from online resources.
● Our lexicon is categorized into six types of harassment.
6
Category Count
Sexual 453
Appearance- Related 15
Intellectual 34
Racial 168
Political 23
Generic 44
7. Corpus Development
● We chose Twitter (313 million active users monthly posted over 500 million tweet per day)
● We utilized our lexicon as seed terms for collecting tweets from Twitter (December 18th, 2016
to January 10th 2017).
● we collected 10,000 tweets for each contextual type for a total of 50,000 tweets.
7
8. Annotation
● containing the profane words does not assure that the tweet is harassing
● Human judges annotated the corpus to discriminate harassing tweets from non-harassing
tweets and improve the corpus.
8
Contextual Type Annotated Tweets Yes No
Sexual 3,855 230 3,619
Racial 4,976 701 4,275
Appearance - Related 4,828 678 4,150
Intellectual 4,867 811 4,056
Political 5,663 699 4,964
Combined 24,189 3,119 21,070
9. Agreement Rate
● The corpus only contains tweets that receive at least two “yes” or two “no”
labels
● Cohen’s kappa coefficient measures the quality of our annotation
9
Harassment Type Agreement Rate
Sexual 0.70
Racial 0.84
Appearance-related 1.00
Intellectual 0.80
Political 0.69
10. Comparison
● Golbeck corpus: provides generic annotation, i.e., (i) harassing and (ii) non-harassing.
● This corpus contains 20,428 non-redundant annotated tweets of which only 5,277 are labeled
as harassing.
10
Contextual Type # of Tweets
Sexual 380
Racial 4,148
Appearance- related 145
Intellectual 381
Political 163
Non harassing 41
Total 5277
11. Conclusion and Future Research
● Developed a quality corpus for harassment research which considers type (contextual type)
● This dataset is publicly available at:
https://github.com/Mrezvan94/Harassment-Corpus
● Learning harassing language from the contextual types
● An schema for discriminating various definition of offensive language/hate speech
11
12. Acknowledgment
This project Supported by National Science Foundation (NSF) award CNS 1513721: Context-Aware
Harassment Detection on Social Media.
12
13. At last but not least
We wish a Web for
❖ TRUST
❖ SYMPATHY
❖ LOVE
13
The appearance-related context shows the highest agreement rate whereas the political and sexual contexts have the lowest indicating that these contents are more challenging to judge (ambiguity is higher).