SlideShare a Scribd company logo
1 of 37
Deep learning based Domain-specific text
generation for online harassment detection
Master’s Thesis Defense
Abhishek Nalamothu
Kno.e.sis Center
Department of Computer Science and
Engineering
Wright State University, Dayton, Ohio
June 25th, 2019
Advisor:
Dr. Amit P. Sheth
Committee Members:
Dr. KeKe Chen,
Dr. Valerie Shalin
Mentor:
Dr. Shreyansh Bhatt
Outline
• Online harassment
• Data related problems to detect harassment
• Ways to solve
- Text generation
• Problems with state of the art text generation models
• Solution
• Evaluation and results
• Conclusion and future work
7%
4-in-10 62%
Young folks use social networking
sites as a tool
Unfiltered Anonymous
- 200576% - 2017
Online Harassment[1]
Automatic online Harassment detection
In our examination…
1. Abusive and hate speech tweets [2]
43 K tweets
3.5 K labeled as harassment (8.1%)
39.5 K labeled as Normal (91.9%)
2. Scarcity of positive labeled data
Harassment detection
Machine learning
Well balanced training data is important
1. Collecting more positive labeled data.
2. Active learning
3. Data augmentation by generating self-diverse and different synthetic samples of
minority class
Ways to address …
Text AugmentationText Generation Data Balance
Text Generation
Text generation is one of the most attractive problems in NLP community
Text generation = the conditional probability of the next word in a sequence
P(wn|w1,w2,w3,w4,..wn-1)
Example: He, is, not, that, retard
Deep learning models become state-of-the-art
State-of-the-art Text Generation Models
Recurrent neural networks (RNNs)[3,4]
problem: Long term dependency learning
Example : I grew up in “France”,…….. I speak “French” fluently.
RNNs with LSTM [5,6,7]
Uses memory cell
Problem: Exposure Bias
The objective is to maximize the
likelihood of true token in the training
sequence
Auto Regressive Models
Generative Adversarial Nets
Generator : Learns to confuse
Discriminator by generating high quality
data
Discriminator : Learns to distinguish
whether a given data instance is real or not
State-of-the-art Text Generation Models
Sequence GANs [8,9,10]
Rewarder : High rewards to training data
Low rewards to generated data
Generator : Aims to get high rewards
Rank GANs [11,12,13]
Ranker : High rank to training data
Low rank to generated data
Generator : Aims to get high rank
Sparse Rewards
Mode Collapse
I’m not sexist but it is female face I’m face not sexist
Example :
SeqGAN & Rank GAN problems
Potential to generate meaningful sentence
Not meaningful
Doesn’t help the model to learn until meaningful part of the
sentence
Generator generates a limited diversity of samples, or
even the same sample, regardless of the input.
Inverse Reinforcement learning
Based GANs [14]
Solution To Reward Sparsity : Rewards at token level
Solution to mode collapse : Entropy
Rewarder objective (similar to previous implementations)
High rewards to training data
Low rewards to generated data
The objective of text generator is to maximize the expected reward plus an entropy.
Where,
τ : Generated sentence.
qθ(τ ) : generator
R∅(τ ) : rewarder
Maximize the expected rewards
of the generated texts
Meaning
Maximize Entropy
Diversity
Generator Objective
Inverse Reinforcement learning
Based GANs
• After certain iterations, generated sentence starts loses it’s meaning
Where,
τ : Generated sentence.
qθ(τ ) : generator
R∅(τ ) : rewarder
Maximize the expected rewards
of the generated texts
Meaning
Maximize Entropy
Diversity
Generator Objective problem
Inverse Reinforcement learning
Based GANs
Proposed solution
Incorporated a term which uses domain specific knowledge to the preserve meaning.
• The objective of text generator is to maximize the following:
The expected reward
The sentence level cosine similarity between train data and generated data
An entropy
Maximize the expected
rewards
to the generated texts
Meaning
Maximize Entropy
Diversity
Uses domain-specific
knowledge to preserve
meaning
IRL With Domain-Specific Knowledge
Improvements in learning
Generator objective plot
Relatively…
Better meaningful sentences
Improves..
Balance between diversity and meaning
Rewarder objective plot
Relatively…
Better rewards
Data & Experimental setup
Dataset - 43000 tweets
1. RT @Rambobiggs: Holy hell these people are disgusting https://t.co/or6RE4DRPd
2. can someone sum this up before i call this guy retarded https://t.co/yuQVEUc
3. RT @ashllyd: SICK OF BITCHES ON THE INTERNET 🐍🙅👉https://t.co/BkyqCFx64G
@UKBloggers1 @FemaleBloggerRT @TheGirlGangHQ #fbloggers #fblchat
4. RT @Stafaa__: I hate them hoe ass braids 😂🙌🏾 https://t.co/fr5gMyp4rJ
5. RT @ayevonnn: bruh i fucking hate people like this 😤 https://t.co/dceEXQhnhq
6. @SenWarren @SenJeffMerkley Yes because u idiots never run out of nonsense…
7. RT @notwaving: I hate it when I'm trying to board a bus and there's already an asshole
on it. https://t.co/Qps29bAaoA
Training
Data
3150
(90%) Harassment
(Positive
Labeled)
Tweets
3500
Test
Data
350
(10%)
1. @discordapp bro i cant fucking wait for the Video Screensharing!!! 😭😂😭😂
2. Forgetting to pack a sports bra for the gym is the fucking worst 👿
3. @zezrie @Jezebel It's like it's a fucking crime to have a vagina in this country now! 😭😡👎🏾👎🏾
4. the lyrics is perfect, the melody is so fucking sexy and the guitar OMFG congrats @ShawnMendes
#JFCShawnMendes
5. RT @IIXXIV_: My face be oily as hell when I wake up n I hate it.
6. BLOODY HELL YES I MISS HIM 😭😭 https://t.co/k2v8u3bPDq
7. Sprint is killing its best 50 percent off deal https://t.co/8GvX6DHF5n https://t.co/wWcCgx1fBe ……
https://t.co/YnihEdvIOO https://t.co/Yry1YK0P0d
Normal
(Negative
Labeled)
Tweets
39600
Tweet-Preprocessor
1. RT @Rambobiggs: Holy hell these people are disgusting https://t.co/or6RE4DRPd
2. can someone sum this up before i call this guy retarded https://t.co/yuQVEUc
3. RT @ashllyd: SICK OF BITCHES ON THE INTERNET 🐍🙅👉https://t.co/BkyqCFx64G @UKBloggers1
@FemaleBloggerRT @TheGirlGangHQ #fbloggers #fblchat
4. @discordapp bro i cant fucking wait for the Video Screensharing!!! 😭😂😭😂
5. BLOODY HELL YES I MISS HIM 😭😭 https://t.co/k2v8u3bPDq
URL, Mention, Hashtag, Reserved Words, Emoji
1. holy hell these people are disgusting
2. can someone sum this up before i call this guy retarded
3. sick of bitches on the internet
4. bro i cant fucking wait for the video screensharing
5. bloody hell yes i miss him
Domain Specific Data Examples:
ugh crippling self doubt is the worst
the number 1 angel shows are gonna be fucking wiiiiild
bro i cant fucking wait for the video screensharing
Andy carroll at the vodafone stadium tonight n cannot fucking wait
yesssssssssss screams oh my bloody hell that would be my two fav youtuber collabing
Domain- Specific data
Domain-specific
data
(Tweets)
Skip-gram Model Word embedding
Abusive and hate speech tweets [1] as domain specific data
Size: 43 K tweets
Tweets generation
Inverse Reinforcement
Learning based GANs
with Domain-Specific
Knowledge
(Our Model)
Harassment Tweets
(Positive Labeled)
Training data : 3150
Vocabulary : ~ 6200
Domain-Specific data
(closely related) 43000
Embeddings
Generated
Tweets
600
(~ 20% of
training data)
Generated Tweets
@ user trump is one ugly hoe​
@ user sick i hate the metro​
@ user fuck yes i would have slapped the fuck out her​
@ user islamic state says us being run by an idiot is getting disgusting​
@ user he was just a puppet idiot on a string​
@ user yo momma had them ugly ass heels not me​
@ user this idiot called it devoutness instead of wormers​
@ user islamic state says us being run by an idiot is fucked up​
@ user pissed off because im sick and i cannot sleep because of it​
@ user you had this worst fucking part and chest like signboard​
Diversity
Tweet from Train data: akademiks is fucking ugly lmfaooooo
Generated tweet: akademiks is fucking disgusting
Tweets from Train data:
fuck sake do not let this slip you retard
i hate when a manly looking bitch say she want a nigga like shorty you is a nigga
do not you hate when a bitch say you cannot fight like come here bitch lemme see where yoo hands
i hate bitches that do not know how to mind they business
i hate bitches that say this
i hate bitches that always wanna argue
i hate bitches that wear heels but cannot walk in em
Generated tweet: let this bitch say i hate bitches
Example 1
Example 2
Diversity
Tweets from Train data:
i can not wait to be done with these disgusting bitches
i be on my ugly ass niggah line
Generated tweet: done with these ugly ass niggah
Tweets from Train data:
two beautiful and sexy best woman hot body sensual hot scene ********* ******* ********
bad bitches do not take days off
some ugly bitches on tonights comedine with me show
yessss people underestimate this they dont wanna gag and shit but do it slurp on that dick
bitch do your fucking homework
you are out of your fucking mind your disgusting lying reckless fucking mind
roman beating undertaker is one thing but that was his fucking retirement match
Generated tweet: two beautiful bitches and shit in your fucking retirement world
Example 3
Example 4
Harassment detection pipeline setup
Augmentation improves detection task by 10%
Evaluation metrics
• Originally BLEU [15] is a metric to evaluate the quality of machine-translated text.
• BLEU is to compute the quality generated set
Example
Calculation:
The Higher the BLEU score is, the better quality the generator gets.
Problem :
Objective of our model is to generate data which is diverse from train data set.
BiLingual Evaluation Understudy (BLEU)
Candidate the the cat
Reference 1 the cat is on the mat
Evaluation
metric
Measures.. Reference set Evaluated set Desired
direction
BLEUdiv
Train-diversity Train data Generated data Lower
Self-BLEU Self-diversity Generated sentence Remaining generated
sentences
Lower
Normalized
perplexity
Quality Domain specific data Generated data Higher
Cosine Similarity Quality Train data Generated data Higher
Evaluation metrics
Normalized perplexity:
Perplexity
Results
Future work
Better knowledge embedding
Style based generation
References
[1] Perrin, Andrew. "Social media usage: 2005–2015. 2015." URL http://www. pewinternet.
org/2015/10/08/social-networking-usage-2005-2015 (2017).
[2] Founta, Antigoni Maria, et al. "Large scale crowdsourcing and characterization of twitter abusive
behavior." Twelfth International AAAI Conference on Web and Social Media. 2018.
[3] Mikolov, Tomáš, et al. "Recurrent neural network based language model." Eleventh annual conference of the
international speech communication association. 2010.
[4] Mikolov, Tomáš, et al. "Extensions of recurrent neural network language model." 2011 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011.
[5] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997):
1735-1780.
[6] Graves, Alex. "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850
(2013).
[7] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks."
Advances in neural information processing systems. 2014.
[8]Yu, Lantao, et al. "Seqgan: Sequence generative adversarial nets with policy gradient." Thirty-First AAAI
Conference on Artificial Intelligence. 2017.
References
[9]Shrivastava, Ashish, et al. "Learning from simulated and unsupervised images through adversarial
training." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[10] Li, Jiwei, et al. "Adversarial learning for neural dialogue generation." arXiv preprint
arXiv:1701.06547 (2017).
[11]Lin, Kevin, et al. "Adversarial ranking for language generation." Advances in Neural Information
Processing Systems. 2017.
[12]Guo, Jiaxian, et al. "Long text generation via adversarial training with leaked
information." Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
[13]Yang, Zichao, et al. "Unsupervised text style transfer using language models as
discriminators." Advances in Neural Information Processing Systems. 2018.
[14]Z. Shi, X. Chen, X. Qiu, and X. Huang, “Towards diverse text generation with inverse
reinforcement learning,” arXiv preprint arXiv:1804.11258, 2018.
[15] Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for
automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pages 311–318.
ACL.
Acknowledgement
Committee members:
Dr. Amit Sheth
(Advisor)
Dr. Valerie ShalinDr. Keke Chen
Dr. Shreyansh Bhatt
Dr. Krishnaprasad Thirunarayan
co- author:
Thank you
Questions ?

More Related Content

Similar to Deep learning based Domain-specific text generation for online harassment detection

Mining Twitter to Understand Engineering Students' Experiences
Mining Twitter to Understand Engineering Students' ExperiencesMining Twitter to Understand Engineering Students' Experiences
Mining Twitter to Understand Engineering Students' ExperiencesXin Chen
 
Real World NLP, ML, and Big Data
Real World NLP, ML, and Big DataReal World NLP, ML, and Big Data
Real World NLP, ML, and Big DataDevin Bost
 
Integrating Technology-rich Assignments in the Curriculum
Integrating Technology-rich Assignments in the CurriculumIntegrating Technology-rich Assignments in the Curriculum
Integrating Technology-rich Assignments in the CurriculumLaurel Hitchcock
 
brief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsbrief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsParham Zilouchian
 
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016Seattle DAML meetup
 
from pq import from search import class InformedNode(No
from pq import from search import class InformedNode(Nofrom pq import from search import class InformedNode(No
from pq import from search import class InformedNode(NoJeanmarieColbert3
 
From pq import from search import class informed node(no
From pq import from search import class informed node(noFrom pq import from search import class informed node(no
From pq import from search import class informed node(nojoney4
 
Forms are boring
Forms are boringForms are boring
Forms are boringJoe Leech
 
SA_presentation2_no_animations
SA_presentation2_no_animationsSA_presentation2_no_animations
SA_presentation2_no_animationsNantia Makrynioti
 
Introduction to Recommender System
Introduction to Recommender SystemIntroduction to Recommender System
Introduction to Recommender SystemWQ Fan
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKoundinya Desiraju
 
Reviews On Essay Writing Services
Reviews On Essay Writing ServicesReviews On Essay Writing Services
Reviews On Essay Writing ServicesJen Cloud
 
Matthew_Davis_Slides.pptx
Matthew_Davis_Slides.pptxMatthew_Davis_Slides.pptx
Matthew_Davis_Slides.pptxreenarocky
 
[DrupalCon] Erase Unconscious Bias From Your AI Datasets
[DrupalCon] Erase Unconscious Bias From Your AI Datasets[DrupalCon] Erase Unconscious Bias From Your AI Datasets
[DrupalCon] Erase Unconscious Bias From Your AI DatasetsLauren Maffeo
 
Twitter Sentiment Analysis - final - no personal
Twitter Sentiment Analysis - final - no personalTwitter Sentiment Analysis - final - no personal
Twitter Sentiment Analysis - final - no personalChad Koziel
 

Similar to Deep learning based Domain-specific text generation for online harassment detection (20)

Mining Twitter to Understand Engineering Students' Experiences
Mining Twitter to Understand Engineering Students' ExperiencesMining Twitter to Understand Engineering Students' Experiences
Mining Twitter to Understand Engineering Students' Experiences
 
Real World NLP, ML, and Big Data
Real World NLP, ML, and Big DataReal World NLP, ML, and Big Data
Real World NLP, ML, and Big Data
 
R04503105108
R04503105108R04503105108
R04503105108
 
Integrating Technology-rich Assignments in the Curriculum
Integrating Technology-rich Assignments in the CurriculumIntegrating Technology-rich Assignments in the Curriculum
Integrating Technology-rich Assignments in the Curriculum
 
brief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsbrief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANs
 
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
Towards Automatic Moderation of Online Hate Speech - Emily Spahn, March 2016
 
Making connections
Making connectionsMaking connections
Making connections
 
from pq import from search import class InformedNode(No
from pq import from search import class InformedNode(Nofrom pq import from search import class InformedNode(No
from pq import from search import class InformedNode(No
 
From pq import from search import class informed node(no
From pq import from search import class informed node(noFrom pq import from search import class informed node(no
From pq import from search import class informed node(no
 
Forms are boring
Forms are boringForms are boring
Forms are boring
 
SA_presentation2_no_animations
SA_presentation2_no_animationsSA_presentation2_no_animations
SA_presentation2_no_animations
 
Introduction To Pc Security
Introduction To Pc SecurityIntroduction To Pc Security
Introduction To Pc Security
 
Introduction To Pc Security
Introduction To Pc SecurityIntroduction To Pc Security
Introduction To Pc Security
 
Introduction To Pc Security
Introduction To Pc SecurityIntroduction To Pc Security
Introduction To Pc Security
 
Introduction to Recommender System
Introduction to Recommender SystemIntroduction to Recommender System
Introduction to Recommender System
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Reviews On Essay Writing Services
Reviews On Essay Writing ServicesReviews On Essay Writing Services
Reviews On Essay Writing Services
 
Matthew_Davis_Slides.pptx
Matthew_Davis_Slides.pptxMatthew_Davis_Slides.pptx
Matthew_Davis_Slides.pptx
 
[DrupalCon] Erase Unconscious Bias From Your AI Datasets
[DrupalCon] Erase Unconscious Bias From Your AI Datasets[DrupalCon] Erase Unconscious Bias From Your AI Datasets
[DrupalCon] Erase Unconscious Bias From Your AI Datasets
 
Twitter Sentiment Analysis - final - no personal
Twitter Sentiment Analysis - final - no personalTwitter Sentiment Analysis - final - no personal
Twitter Sentiment Analysis - final - no personal
 

Recently uploaded

Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 

Deep learning based Domain-specific text generation for online harassment detection

  • 1. Deep learning based Domain-specific text generation for online harassment detection Master’s Thesis Defense Abhishek Nalamothu Kno.e.sis Center Department of Computer Science and Engineering Wright State University, Dayton, Ohio June 25th, 2019 Advisor: Dr. Amit P. Sheth Committee Members: Dr. KeKe Chen, Dr. Valerie Shalin Mentor: Dr. Shreyansh Bhatt
  • 2. Outline • Online harassment • Data related problems to detect harassment • Ways to solve - Text generation • Problems with state of the art text generation models • Solution • Evaluation and results • Conclusion and future work
  • 3. 7% 4-in-10 62% Young folks use social networking sites as a tool Unfiltered Anonymous - 200576% - 2017 Online Harassment[1]
  • 5. In our examination… 1. Abusive and hate speech tweets [2] 43 K tweets 3.5 K labeled as harassment (8.1%) 39.5 K labeled as Normal (91.9%) 2. Scarcity of positive labeled data Harassment detection Machine learning Well balanced training data is important
  • 6. 1. Collecting more positive labeled data. 2. Active learning 3. Data augmentation by generating self-diverse and different synthetic samples of minority class Ways to address … Text AugmentationText Generation Data Balance
  • 7. Text Generation Text generation is one of the most attractive problems in NLP community Text generation = the conditional probability of the next word in a sequence P(wn|w1,w2,w3,w4,..wn-1) Example: He, is, not, that, retard Deep learning models become state-of-the-art
  • 8. State-of-the-art Text Generation Models Recurrent neural networks (RNNs)[3,4] problem: Long term dependency learning Example : I grew up in “France”,…….. I speak “French” fluently. RNNs with LSTM [5,6,7] Uses memory cell Problem: Exposure Bias The objective is to maximize the likelihood of true token in the training sequence Auto Regressive Models
  • 9. Generative Adversarial Nets Generator : Learns to confuse Discriminator by generating high quality data Discriminator : Learns to distinguish whether a given data instance is real or not State-of-the-art Text Generation Models Sequence GANs [8,9,10] Rewarder : High rewards to training data Low rewards to generated data Generator : Aims to get high rewards Rank GANs [11,12,13] Ranker : High rank to training data Low rank to generated data Generator : Aims to get high rank
  • 10. Sparse Rewards Mode Collapse I’m not sexist but it is female face I’m face not sexist Example : SeqGAN & Rank GAN problems Potential to generate meaningful sentence Not meaningful Doesn’t help the model to learn until meaningful part of the sentence Generator generates a limited diversity of samples, or even the same sample, regardless of the input.
  • 11. Inverse Reinforcement learning Based GANs [14] Solution To Reward Sparsity : Rewards at token level Solution to mode collapse : Entropy Rewarder objective (similar to previous implementations) High rewards to training data Low rewards to generated data
  • 12. The objective of text generator is to maximize the expected reward plus an entropy. Where, τ : Generated sentence. qθ(τ ) : generator R∅(τ ) : rewarder Maximize the expected rewards of the generated texts Meaning Maximize Entropy Diversity Generator Objective Inverse Reinforcement learning Based GANs
  • 13. • After certain iterations, generated sentence starts loses it’s meaning Where, τ : Generated sentence. qθ(τ ) : generator R∅(τ ) : rewarder Maximize the expected rewards of the generated texts Meaning Maximize Entropy Diversity Generator Objective problem Inverse Reinforcement learning Based GANs
  • 15. Incorporated a term which uses domain specific knowledge to the preserve meaning. • The objective of text generator is to maximize the following: The expected reward The sentence level cosine similarity between train data and generated data An entropy Maximize the expected rewards to the generated texts Meaning Maximize Entropy Diversity Uses domain-specific knowledge to preserve meaning IRL With Domain-Specific Knowledge
  • 17. Generator objective plot Relatively… Better meaningful sentences Improves.. Balance between diversity and meaning
  • 20. Dataset - 43000 tweets 1. RT @Rambobiggs: Holy hell these people are disgusting https://t.co/or6RE4DRPd 2. can someone sum this up before i call this guy retarded https://t.co/yuQVEUc 3. RT @ashllyd: SICK OF BITCHES ON THE INTERNET 🐍🙅👉https://t.co/BkyqCFx64G @UKBloggers1 @FemaleBloggerRT @TheGirlGangHQ #fbloggers #fblchat 4. RT @Stafaa__: I hate them hoe ass braids 😂🙌🏾 https://t.co/fr5gMyp4rJ 5. RT @ayevonnn: bruh i fucking hate people like this 😤 https://t.co/dceEXQhnhq 6. @SenWarren @SenJeffMerkley Yes because u idiots never run out of nonsense… 7. RT @notwaving: I hate it when I'm trying to board a bus and there's already an asshole on it. https://t.co/Qps29bAaoA Training Data 3150 (90%) Harassment (Positive Labeled) Tweets 3500 Test Data 350 (10%) 1. @discordapp bro i cant fucking wait for the Video Screensharing!!! 😭😂😭😂 2. Forgetting to pack a sports bra for the gym is the fucking worst 👿 3. @zezrie @Jezebel It's like it's a fucking crime to have a vagina in this country now! 😭😡👎🏾👎🏾 4. the lyrics is perfect, the melody is so fucking sexy and the guitar OMFG congrats @ShawnMendes #JFCShawnMendes 5. RT @IIXXIV_: My face be oily as hell when I wake up n I hate it. 6. BLOODY HELL YES I MISS HIM 😭😭 https://t.co/k2v8u3bPDq 7. Sprint is killing its best 50 percent off deal https://t.co/8GvX6DHF5n https://t.co/wWcCgx1fBe …… https://t.co/YnihEdvIOO https://t.co/Yry1YK0P0d Normal (Negative Labeled) Tweets 39600
  • 21. Tweet-Preprocessor 1. RT @Rambobiggs: Holy hell these people are disgusting https://t.co/or6RE4DRPd 2. can someone sum this up before i call this guy retarded https://t.co/yuQVEUc 3. RT @ashllyd: SICK OF BITCHES ON THE INTERNET 🐍🙅👉https://t.co/BkyqCFx64G @UKBloggers1 @FemaleBloggerRT @TheGirlGangHQ #fbloggers #fblchat 4. @discordapp bro i cant fucking wait for the Video Screensharing!!! 😭😂😭😂 5. BLOODY HELL YES I MISS HIM 😭😭 https://t.co/k2v8u3bPDq URL, Mention, Hashtag, Reserved Words, Emoji 1. holy hell these people are disgusting 2. can someone sum this up before i call this guy retarded 3. sick of bitches on the internet 4. bro i cant fucking wait for the video screensharing 5. bloody hell yes i miss him
  • 22. Domain Specific Data Examples: ugh crippling self doubt is the worst the number 1 angel shows are gonna be fucking wiiiiild bro i cant fucking wait for the video screensharing Andy carroll at the vodafone stadium tonight n cannot fucking wait yesssssssssss screams oh my bloody hell that would be my two fav youtuber collabing Domain- Specific data Domain-specific data (Tweets) Skip-gram Model Word embedding Abusive and hate speech tweets [1] as domain specific data Size: 43 K tweets
  • 23. Tweets generation Inverse Reinforcement Learning based GANs with Domain-Specific Knowledge (Our Model) Harassment Tweets (Positive Labeled) Training data : 3150 Vocabulary : ~ 6200 Domain-Specific data (closely related) 43000 Embeddings Generated Tweets 600 (~ 20% of training data)
  • 24. Generated Tweets @ user trump is one ugly hoe​ @ user sick i hate the metro​ @ user fuck yes i would have slapped the fuck out her​ @ user islamic state says us being run by an idiot is getting disgusting​ @ user he was just a puppet idiot on a string​ @ user yo momma had them ugly ass heels not me​ @ user this idiot called it devoutness instead of wormers​ @ user islamic state says us being run by an idiot is fucked up​ @ user pissed off because im sick and i cannot sleep because of it​ @ user you had this worst fucking part and chest like signboard​
  • 25. Diversity Tweet from Train data: akademiks is fucking ugly lmfaooooo Generated tweet: akademiks is fucking disgusting Tweets from Train data: fuck sake do not let this slip you retard i hate when a manly looking bitch say she want a nigga like shorty you is a nigga do not you hate when a bitch say you cannot fight like come here bitch lemme see where yoo hands i hate bitches that do not know how to mind they business i hate bitches that say this i hate bitches that always wanna argue i hate bitches that wear heels but cannot walk in em Generated tweet: let this bitch say i hate bitches Example 1 Example 2
  • 26. Diversity Tweets from Train data: i can not wait to be done with these disgusting bitches i be on my ugly ass niggah line Generated tweet: done with these ugly ass niggah Tweets from Train data: two beautiful and sexy best woman hot body sensual hot scene ********* ******* ******** bad bitches do not take days off some ugly bitches on tonights comedine with me show yessss people underestimate this they dont wanna gag and shit but do it slurp on that dick bitch do your fucking homework you are out of your fucking mind your disgusting lying reckless fucking mind roman beating undertaker is one thing but that was his fucking retirement match Generated tweet: two beautiful bitches and shit in your fucking retirement world Example 3 Example 4
  • 27. Harassment detection pipeline setup Augmentation improves detection task by 10%
  • 29. • Originally BLEU [15] is a metric to evaluate the quality of machine-translated text. • BLEU is to compute the quality generated set Example Calculation: The Higher the BLEU score is, the better quality the generator gets. Problem : Objective of our model is to generate data which is diverse from train data set. BiLingual Evaluation Understudy (BLEU) Candidate the the cat Reference 1 the cat is on the mat
  • 30. Evaluation metric Measures.. Reference set Evaluated set Desired direction BLEUdiv Train-diversity Train data Generated data Lower Self-BLEU Self-diversity Generated sentence Remaining generated sentences Lower Normalized perplexity Quality Domain specific data Generated data Higher Cosine Similarity Quality Train data Generated data Higher Evaluation metrics
  • 33. Future work Better knowledge embedding Style based generation
  • 34. References [1] Perrin, Andrew. "Social media usage: 2005–2015. 2015." URL http://www. pewinternet. org/2015/10/08/social-networking-usage-2005-2015 (2017). [2] Founta, Antigoni Maria, et al. "Large scale crowdsourcing and characterization of twitter abusive behavior." Twelfth International AAAI Conference on Web and Social Media. 2018. [3] Mikolov, Tomáš, et al. "Recurrent neural network based language model." Eleventh annual conference of the international speech communication association. 2010. [4] Mikolov, Tomáš, et al. "Extensions of recurrent neural network language model." 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011. [5] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. [6] Graves, Alex. "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850 (2013). [7] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014. [8]Yu, Lantao, et al. "Seqgan: Sequence generative adversarial nets with policy gradient." Thirty-First AAAI Conference on Artificial Intelligence. 2017.
  • 35. References [9]Shrivastava, Ashish, et al. "Learning from simulated and unsupervised images through adversarial training." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. [10] Li, Jiwei, et al. "Adversarial learning for neural dialogue generation." arXiv preprint arXiv:1701.06547 (2017). [11]Lin, Kevin, et al. "Adversarial ranking for language generation." Advances in Neural Information Processing Systems. 2017. [12]Guo, Jiaxian, et al. "Long text generation via adversarial training with leaked information." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. [13]Yang, Zichao, et al. "Unsupervised text style transfer using language models as discriminators." Advances in Neural Information Processing Systems. 2018. [14]Z. Shi, X. Chen, X. Qiu, and X. Huang, “Towards diverse text generation with inverse reinforcement learning,” arXiv preprint arXiv:1804.11258, 2018. [15] Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pages 311–318. ACL.
  • 36. Acknowledgement Committee members: Dr. Amit Sheth (Advisor) Dr. Valerie ShalinDr. Keke Chen Dr. Shreyansh Bhatt Dr. Krishnaprasad Thirunarayan co- author:

Editor's Notes

  1. According to a report by the Pew research center, 7% of American adults use social networking sites in 2005, by 2017 it becomes 76%[1]. Roughly four-in-ten Americans have personally experienced online harassment, and 62% consider it a major problem [1] . Particularly young folks use social networking sites as a tool to maintain their relations to friends, classmates. Unfiltered and sometimes anonymous exchange of content in social media increase the likelihood of online harassment.
  2. Hence, automatic online harassment detection is vital for online social networks.
  3. machine learning plays a crucial role for building an automatic online harassment detection model. Well balanced training data supports building a good machine learning model for classification. In our examination on abusive and hate speech tweets data set, we found that proportion of harassment labeled tweets to non-harassment labeled tweets is small. This dataset contains 46.5K tweets. Among those tweets 3500 tweets are labeled as harassment and 43000 tweets are labeled as non-harassment. This scarcity of positive labeled data in the training set leads deep learning methods biased towards the majority class, which causes high misclassification for the minority class. This is one of the most challenging aspects during the development of an automatic online harassment detection system.
  4. These are some of the ways to address this positive labeled data scarcity problem. the obvious way is to collect more positive labeled data. Other way is active learning, which involves human in the loop. In this active learning, we first train our model on available labeled data. Then we classify unlabeled data using this trained model. Then we use humans to verify the labels with low confidence. Another way is Data augmentation by generating self-diverse and train-diverse synthetic samples of minority class. In this research we want to explore whether we can create augmented text data by generating synthetic samples to solve the positive labeled data scarcity problem. Before we get into details of text generation, I want to brief about what is our problem and what method we choose to address this problem. Our problem is scarcity of positive labeled data in train set which leads to poor classification model. To address this problem, we choose self-diverse and train-diverse text generation method.
  5. Text generation : Text generation is one of the most attractive problems in NLP community. Text generation is a language model which learns the conditional probability of the next word in a sequence of words. Deep learning methods become the state-of-the-art techniques for text generation.
  6. Following are the state-of-the-art techniques for text generation Recurrent neural networks (RNNs) The objective of the RNN is to maximize the likelihood of true token in the training sequence given the previous observed tokens. These RNNs have long term dependency learning problem: For example : I grew up in “France”,.. I speak “French” fluently. This kind of dependencies between sequence data is called long-term dependencies because the distance between the relevant information “France” and the point where it is needed to make a prediction “French” is very wide. RNNs with LSTM solves this problem using memory cell. But this model also suffers from exposure bias. During training stage model predicts next token given previous generated tokens, which the model does not seen before. So, errors accumulate along the generation process. This we called as exposure bias. General adversarial net (GAN) proposed by (Goodfellow and others 2014) is a promising framework for alleviating this problem.
  7. This framework has two modules, generator and discriminator. Objective of discriminator is to learn to distinguish whether a given data instance is real or not, and generator learns to confuse discriminator by generating high quality data. In Sequence GANs, we treat discriminator as rewarder Objective of generator is to get more rewards by generating meaningful sentences. Objective of the rewarder is to give more rewards to train data and give less rewards to generated data. In Rank GANs, we treat discriminator as Ranker Objective of generator is to get high rank by generating meaningful sentences. Objective of the Ranker is to give high rank to train data and give low ranks to generated data. Inspite of their success, SeqGAN and Rank GAN still faces two challenges.
  8. The first problem is Sparse rewards and second problem is mode collapse. Learning process becomes inefficient by giving single reward to the entire sequence. For example: I’m not sexist but it is female face I’m face not sexist In this generated sequence, the part of the sequence I’m not sexist but it is female face have potential to become a meaningful sentence, but the entire sequence is not meaningful. By giving single reward to the entire sequence, in this case it would be low, we are discouraging the generator model to learn the part of the sequence which has potential to become a meaningful sentence.   Mode collapse Because of mode collapse, generator generates a limited diversity of samples, or even the same sample, regardless of the input. The latest work, Inverse reinforcement learning based GANs tackles the above two problems. We consider this as our base line model.
  9. By giving rewards at token level, this method addresses the sparse reward problem. By maximizing entropy at generator, the generator generates diverse text, which prevents mode collapse problem. Here, the rewarder function is similar to the previous GANs. Which aims to give more reward to train data and low rewards to the generated data.
  10. And the objective of the generator is to maximize the rewards of generated sentences plus entropy. But the problem with this objective is after certain iterations, because maximized entropy generated sentence starts loses its meaning. To solve the problem, we incorporated a term which uses domain specific knowledge to the preserve meaning.
  11. But the problem with this objective is after certain iterations, because maximized entropy generated sentence starts loses its meaning. that increasing entropy and rewards can get you better text generator. However, you will need more data to do this. If you do not have good enough data then it becomes hard for a generator to learn how to balance the reward maximization and exploration. Hence, it can result in less meaningful text as the generator puts more weight on the entropy or it can result in similar text generated if you put less weight on the exploration. Hence, we need an external data-source or knowledge-source to preserve meaning so that even after putting more weight on the exploration part (entropy), we dont' compromise on the meaning of the text.  In this dissertation, we extend the optimization to incorporate a term that preserves the meaning so as to generate diverse and meaningful text.