A quality type aware annotated corpus and lexicon for harassment research

•Download as PPTX, PDF•

3 likes•228 views

Saeedeh Shekarpour

Presented in Web Science 2018 conference, Amsterdam, May 2018

Technology

A Quality Type-aware Annotated Corpus and Lexicon for
Harassment Research
Presenter: Saeedeh Shekarpour
Authors: Mohammadreza Rezvan, Saeedeh
Shekarpour, Lakshika Balasuriya,Krishnaprasad
Thirunarayan, Valerie L. Shalin, Amit Sheth
1

Motivation
● Social media is a is being used extensively by people from various age groups
● Participants in such media experience insult, bullying, harassment and etc.
● 66% of internet users who have experience harassment on social media app (18% are serious)
● It poses challenges to social engagement and trust, resulting in emotional distress, privacy
concerns and threats to physical safety
2

Our Contribution (considering type/context)
3

Goal
developing a quality corpus which is type-aware
4

Steps
➔ Lexicon Creation
➔ Corpus Development
➔ Annotation
➔ Agreement Rate
5

Lexicon Creation
● We compiled our lexicon from online resources.
● Our lexicon is categorized into six types of harassment.
6
Category Count
Sexual 453
Appearance- Related 15
Intellectual 34
Racial 168
Political 23
Generic 44

Corpus Development
● We chose Twitter (313 million active users monthly posted over 500 million tweet per day)
● We utilized our lexicon as seed terms for collecting tweets from Twitter (December 18th, 2016
to January 10th 2017).
● we collected 10,000 tweets for each contextual type for a total of 50,000 tweets.
7

Annotation
● containing the profane words does not assure that the tweet is harassing
● Human judges annotated the corpus to discriminate harassing tweets from non-harassing
tweets and improve the corpus.
8
Contextual Type Annotated Tweets Yes No
Sexual 3,855 230 3,619
Racial 4,976 701 4,275
Appearance - Related 4,828 678 4,150
Intellectual 4,867 811 4,056
Political 5,663 699 4,964
Combined 24,189 3,119 21,070

Agreement Rate
● The corpus only contains tweets that receive at least two “yes” or two “no”
labels
● Cohen’s kappa coefficient measures the quality of our annotation
9
Harassment Type Agreement Rate
Sexual 0.70
Racial 0.84
Appearance-related 1.00
Intellectual 0.80
Political 0.69

Comparison
● Golbeck corpus: provides generic annotation, i.e., (i) harassing and (ii) non-harassing.
● This corpus contains 20,428 non-redundant annotated tweets of which only 5,277 are labeled
as harassing.
10
Contextual Type # of Tweets
Sexual 380
Racial 4,148
Appearance- related 145
Intellectual 381
Political 163
Non harassing 41
Total 5277

Conclusion and Future Research
● Developed a quality corpus for harassment research which considers type (contextual type)
● This dataset is publicly available at:
https://github.com/Mrezvan94/Harassment-Corpus
● Learning harassing language from the contextual types
● An schema for discriminating various definition of offensive language/hate speech
11

Acknowledgment
This project Supported by National Science Foundation (NSF) award CNS 1513721: Context-Aware
Harassment Detection on Social Media.
12

At last but not least
We wish a Web for
❖ TRUST
❖ SYMPATHY
❖ LOVE
13

Similar to A quality type aware annotated corpus and lexicon for harassment research

1 - Reality ConstructEach individual observes the world thro.docxcroftsshanon

How Current is My Message About Ethics? (21 Question Quiz)7Lenses

My friends muddy the waters: How a Statement of Principles became a Public Fi...Omar Ha-Redeye

Build a Social Media Toolkit! Strategies for organisations to engage and opti...Health Evidence™

analyzing public sentiments using twitter feedsOrakzay

For this assessment you will create an 8 slide PowerPoint presenta.docxgreg1eden90113

Oe peer learning group 1 - session 4 - april 18 Beth Kanter

Essay On Fifa World Cup FeverChristy Williams

Executive Directors Chat Initiating Equity for Impact.pdfTechSoup

Norton Field Guide for Speaking 9.1Coastal Carolina Community College

[ACL][socialNLP][2020]BEEP_Korean_corpus_of_online_news_comments_for_toxic_sp...Ji Hyung Moon

Measuring people’s perceptions, evaluations and experiences: Why they matter ...StatsCommunications

iReflect Movement Social Media StrategyIsmelda Alvarez

long version LGBTQIA Inclusive PracticesChrista Spielman

Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Zer...Spotle.ai

Sentimental Analysis of twitter data .Greater Noida Institute Of Technology

Youth developmentEmmanuel Dorgbaa

Racial justice and the climate movementEPIPNational

U.S. Religious Landscape on TwitterLu Chen

Blogging and social media in Brussels for FutureLab EuropeJon Worth

Similar to A quality type aware annotated corpus and lexicon for harassment research (20)

1 - Reality ConstructEach individual observes the world thro.docx

How Current is My Message About Ethics? (21 Question Quiz)

My friends muddy the waters: How a Statement of Principles became a Public Fi...

Build a Social Media Toolkit! Strategies for organisations to engage and opti...

analyzing public sentiments using twitter feeds

For this assessment you will create an 8 slide PowerPoint presenta.docx

Oe peer learning group 1 - session 4 - april 18

Essay On Fifa World Cup Fever

Executive Directors Chat Initiating Equity for Impact.pdf

Norton Field Guide for Speaking 9.1

[ACL][socialNLP][2020]BEEP_Korean_corpus_of_online_news_comments_for_toxic_sp...

Measuring people’s perceptions, evaluations and experiences: Why they matter ...

iReflect Movement Social Media Strategy

long version LGBTQIA Inclusive Practices

Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team Zer...

Sentimental Analysis of twitter data .

Youth development

Racial justice and the climate movement

U.S. Religious Landscape on Twitter

Blogging and social media in Brussels for FutureLab Europe

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Gen AI in Business - Global Trends Report 2024.pdfAddepto

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Artificial intelligence in cctv survelliance.pptxhariprasad279825

CloudStudio User manual (basic edition):comworks

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024

Streamlining Python Development: A Guide to a Modern Project Setup

Commit 2024 - Secret Management made easy

Search Engine Optimization SEO PDF for 2024.pdf

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Unleash Your Potential - Namagunga Girls Coding Club

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Ensuring Technical Readiness For Copilot in Microsoft 365

Developer Data Modeling Mistakes: From Postgres to NoSQL

Gen AI in Business - Global Trends Report 2024.pdf

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

"Debugging python applications inside k8s environment", Andrii Soldatenko

Artificial intelligence in cctv survelliance.pptx

CloudStudio User manual (basic edition):

DevoxxFR 2024 Reproducible Builds with Apache Maven

DevEX - reference for building teams, processes, and platforms

WordPress Websites for Engineers: Elevate Your Brand

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Nell’iperspazio con Rocket: il Framework Web di Rust!

A quality type aware annotated corpus and lexicon for harassment research

1. A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research Presenter: Saeedeh Shekarpour Authors: Mohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya,Krishnaprasad Thirunarayan, Valerie L. Shalin, Amit Sheth 1

2. Motivation ● Social media is a is being used extensively by people from various age groups ● Participants in such media experience insult, bullying, harassment and etc. ● 66% of internet users who have experience harassment on social media app (18% are serious) ● It poses challenges to social engagement and trust, resulting in emotional distress, privacy concerns and threats to physical safety 2

3. Our Contribution (considering type/context) 3

4. Goal developing a quality corpus which is type-aware 4

5. Steps ➔ Lexicon Creation ➔ Corpus Development ➔ Annotation ➔ Agreement Rate 5

6. Lexicon Creation ● We compiled our lexicon from online resources. ● Our lexicon is categorized into six types of harassment. 6 Category Count Sexual 453 Appearance- Related 15 Intellectual 34 Racial 168 Political 23 Generic 44

7. Corpus Development ● We chose Twitter (313 million active users monthly posted over 500 million tweet per day) ● We utilized our lexicon as seed terms for collecting tweets from Twitter (December 18th, 2016 to January 10th 2017). ● we collected 10,000 tweets for each contextual type for a total of 50,000 tweets. 7

8. Annotation ● containing the profane words does not assure that the tweet is harassing ● Human judges annotated the corpus to discriminate harassing tweets from non-harassing tweets and improve the corpus. 8 Contextual Type Annotated Tweets Yes No Sexual 3,855 230 3,619 Racial 4,976 701 4,275 Appearance - Related 4,828 678 4,150 Intellectual 4,867 811 4,056 Political 5,663 699 4,964 Combined 24,189 3,119 21,070

9. Agreement Rate ● The corpus only contains tweets that receive at least two “yes” or two “no” labels ● Cohen’s kappa coefficient measures the quality of our annotation 9 Harassment Type Agreement Rate Sexual 0.70 Racial 0.84 Appearance-related 1.00 Intellectual 0.80 Political 0.69

10. Comparison ● Golbeck corpus: provides generic annotation, i.e., (i) harassing and (ii) non-harassing. ● This corpus contains 20,428 non-redundant annotated tweets of which only 5,277 are labeled as harassing. 10 Contextual Type # of Tweets Sexual 380 Racial 4,148 Appearance- related 145 Intellectual 381 Political 163 Non harassing 41 Total 5277

11. Conclusion and Future Research ● Developed a quality corpus for harassment research which considers type (contextual type) ● This dataset is publicly available at: https://github.com/Mrezvan94/Harassment-Corpus ● Learning harassing language from the contextual types ● An schema for discriminating various definition of offensive language/hate speech 11

12. Acknowledgment This project Supported by National Science Foundation (NSF) award CNS 1513721: Context-Aware Harassment Detection on Social Media. 12

13. At last but not least We wish a Web for ❖ TRUST ❖ SYMPATHY ❖ LOVE 13

14. 14

Editor's Notes

The appearance-related context shows the highest agreement rate whereas the political and sexual contexts have the lowest indicating that these contents are more challenging to judge (ambiguity is higher).

A quality type aware annotated corpus and lexicon for harassment research

Recommended

Recommended

More Related Content

Similar to A quality type aware annotated corpus and lexicon for harassment research

Similar to A quality type aware annotated corpus and lexicon for harassment research (20)

More from Saeedeh Shekarpour

More from Saeedeh Shekarpour (7)

Recently uploaded

Recently uploaded (20)

A quality type aware annotated corpus and lexicon for harassment research

Editor's Notes