call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
Deep learning based Domain-specific text generation for online harassment detection
1. Deep learning based Domain-specific text
generation for online harassment detection
Master’s Thesis Defense
Abhishek Nalamothu
Kno.e.sis Center
Department of Computer Science and
Engineering
Wright State University, Dayton, Ohio
June 25th, 2019
Advisor:
Dr. Amit P. Sheth
Committee Members:
Dr. KeKe Chen,
Dr. Valerie Shalin
Mentor:
Dr. Shreyansh Bhatt
2. Outline
• Online harassment
• Data related problems to detect harassment
• Ways to solve
- Text generation
• Problems with state of the art text generation models
• Solution
• Evaluation and results
• Conclusion and future work
3. 7%
4-in-10 62%
Young folks use social networking
sites as a tool
Unfiltered Anonymous
- 200576% - 2017
Online Harassment[1]
5. In our examination…
1. Abusive and hate speech tweets [2]
43 K tweets
3.5 K labeled as harassment (8.1%)
39.5 K labeled as Normal (91.9%)
2. Scarcity of positive labeled data
Harassment detection
Machine learning
Well balanced training data is important
6. 1. Collecting more positive labeled data.
2. Active learning
3. Data augmentation by generating self-diverse and different synthetic samples of
minority class
Ways to address …
Text AugmentationText Generation Data Balance
7. Text Generation
Text generation is one of the most attractive problems in NLP community
Text generation = the conditional probability of the next word in a sequence
P(wn|w1,w2,w3,w4,..wn-1)
Example: He, is, not, that, retard
Deep learning models become state-of-the-art
8. State-of-the-art Text Generation Models
Recurrent neural networks (RNNs)[3,4]
problem: Long term dependency learning
Example : I grew up in “France”,…….. I speak “French” fluently.
RNNs with LSTM [5,6,7]
Uses memory cell
Problem: Exposure Bias
The objective is to maximize the
likelihood of true token in the training
sequence
Auto Regressive Models
9. Generative Adversarial Nets
Generator : Learns to confuse
Discriminator by generating high quality
data
Discriminator : Learns to distinguish
whether a given data instance is real or not
State-of-the-art Text Generation Models
Sequence GANs [8,9,10]
Rewarder : High rewards to training data
Low rewards to generated data
Generator : Aims to get high rewards
Rank GANs [11,12,13]
Ranker : High rank to training data
Low rank to generated data
Generator : Aims to get high rank
10. Sparse Rewards
Mode Collapse
I’m not sexist but it is female face I’m face not sexist
Example :
SeqGAN & Rank GAN problems
Potential to generate meaningful sentence
Not meaningful
Doesn’t help the model to learn until meaningful part of the
sentence
Generator generates a limited diversity of samples, or
even the same sample, regardless of the input.
11. Inverse Reinforcement learning
Based GANs [14]
Solution To Reward Sparsity : Rewards at token level
Solution to mode collapse : Entropy
Rewarder objective (similar to previous implementations)
High rewards to training data
Low rewards to generated data
12. The objective of text generator is to maximize the expected reward plus an entropy.
Where,
τ : Generated sentence.
qθ(τ ) : generator
R∅(τ ) : rewarder
Maximize the expected rewards
of the generated texts
Meaning
Maximize Entropy
Diversity
Generator Objective
Inverse Reinforcement learning
Based GANs
13. • After certain iterations, generated sentence starts loses it’s meaning
Where,
τ : Generated sentence.
qθ(τ ) : generator
R∅(τ ) : rewarder
Maximize the expected rewards
of the generated texts
Meaning
Maximize Entropy
Diversity
Generator Objective problem
Inverse Reinforcement learning
Based GANs
15. Incorporated a term which uses domain specific knowledge to the preserve meaning.
• The objective of text generator is to maximize the following:
The expected reward
The sentence level cosine similarity between train data and generated data
An entropy
Maximize the expected
rewards
to the generated texts
Meaning
Maximize Entropy
Diversity
Uses domain-specific
knowledge to preserve
meaning
IRL With Domain-Specific Knowledge
20. Dataset - 43000 tweets
1. RT @Rambobiggs: Holy hell these people are disgusting https://t.co/or6RE4DRPd
2. can someone sum this up before i call this guy retarded https://t.co/yuQVEUc
3. RT @ashllyd: SICK OF BITCHES ON THE INTERNET 🐍🙅👉https://t.co/BkyqCFx64G
@UKBloggers1 @FemaleBloggerRT @TheGirlGangHQ #fbloggers #fblchat
4. RT @Stafaa__: I hate them hoe ass braids 😂🙌🏾 https://t.co/fr5gMyp4rJ
5. RT @ayevonnn: bruh i fucking hate people like this 😤 https://t.co/dceEXQhnhq
6. @SenWarren @SenJeffMerkley Yes because u idiots never run out of nonsense…
7. RT @notwaving: I hate it when I'm trying to board a bus and there's already an asshole
on it. https://t.co/Qps29bAaoA
Training
Data
3150
(90%) Harassment
(Positive
Labeled)
Tweets
3500
Test
Data
350
(10%)
1. @discordapp bro i cant fucking wait for the Video Screensharing!!! 😭😂😭😂
2. Forgetting to pack a sports bra for the gym is the fucking worst 👿
3. @zezrie @Jezebel It's like it's a fucking crime to have a vagina in this country now! 😭😡👎🏾👎🏾
4. the lyrics is perfect, the melody is so fucking sexy and the guitar OMFG congrats @ShawnMendes
#JFCShawnMendes
5. RT @IIXXIV_: My face be oily as hell when I wake up n I hate it.
6. BLOODY HELL YES I MISS HIM 😭😭 https://t.co/k2v8u3bPDq
7. Sprint is killing its best 50 percent off deal https://t.co/8GvX6DHF5n https://t.co/wWcCgx1fBe ……
https://t.co/YnihEdvIOO https://t.co/Yry1YK0P0d
Normal
(Negative
Labeled)
Tweets
39600
21. Tweet-Preprocessor
1. RT @Rambobiggs: Holy hell these people are disgusting https://t.co/or6RE4DRPd
2. can someone sum this up before i call this guy retarded https://t.co/yuQVEUc
3. RT @ashllyd: SICK OF BITCHES ON THE INTERNET 🐍🙅👉https://t.co/BkyqCFx64G @UKBloggers1
@FemaleBloggerRT @TheGirlGangHQ #fbloggers #fblchat
4. @discordapp bro i cant fucking wait for the Video Screensharing!!! 😭😂😭😂
5. BLOODY HELL YES I MISS HIM 😭😭 https://t.co/k2v8u3bPDq
URL, Mention, Hashtag, Reserved Words, Emoji
1. holy hell these people are disgusting
2. can someone sum this up before i call this guy retarded
3. sick of bitches on the internet
4. bro i cant fucking wait for the video screensharing
5. bloody hell yes i miss him
22. Domain Specific Data Examples:
ugh crippling self doubt is the worst
the number 1 angel shows are gonna be fucking wiiiiild
bro i cant fucking wait for the video screensharing
Andy carroll at the vodafone stadium tonight n cannot fucking wait
yesssssssssss screams oh my bloody hell that would be my two fav youtuber collabing
Domain- Specific data
Domain-specific
data
(Tweets)
Skip-gram Model Word embedding
Abusive and hate speech tweets [1] as domain specific data
Size: 43 K tweets
23. Tweets generation
Inverse Reinforcement
Learning based GANs
with Domain-Specific
Knowledge
(Our Model)
Harassment Tweets
(Positive Labeled)
Training data : 3150
Vocabulary : ~ 6200
Domain-Specific data
(closely related) 43000
Embeddings
Generated
Tweets
600
(~ 20% of
training data)
24. Generated Tweets
@ user trump is one ugly hoe
@ user sick i hate the metro
@ user fuck yes i would have slapped the fuck out her
@ user islamic state says us being run by an idiot is getting disgusting
@ user he was just a puppet idiot on a string
@ user yo momma had them ugly ass heels not me
@ user this idiot called it devoutness instead of wormers
@ user islamic state says us being run by an idiot is fucked up
@ user pissed off because im sick and i cannot sleep because of it
@ user you had this worst fucking part and chest like signboard
25. Diversity
Tweet from Train data: akademiks is fucking ugly lmfaooooo
Generated tweet: akademiks is fucking disgusting
Tweets from Train data:
fuck sake do not let this slip you retard
i hate when a manly looking bitch say she want a nigga like shorty you is a nigga
do not you hate when a bitch say you cannot fight like come here bitch lemme see where yoo hands
i hate bitches that do not know how to mind they business
i hate bitches that say this
i hate bitches that always wanna argue
i hate bitches that wear heels but cannot walk in em
Generated tweet: let this bitch say i hate bitches
Example 1
Example 2
26. Diversity
Tweets from Train data:
i can not wait to be done with these disgusting bitches
i be on my ugly ass niggah line
Generated tweet: done with these ugly ass niggah
Tweets from Train data:
two beautiful and sexy best woman hot body sensual hot scene ********* ******* ********
bad bitches do not take days off
some ugly bitches on tonights comedine with me show
yessss people underestimate this they dont wanna gag and shit but do it slurp on that dick
bitch do your fucking homework
you are out of your fucking mind your disgusting lying reckless fucking mind
roman beating undertaker is one thing but that was his fucking retirement match
Generated tweet: two beautiful bitches and shit in your fucking retirement world
Example 3
Example 4
29. • Originally BLEU [15] is a metric to evaluate the quality of machine-translated text.
• BLEU is to compute the quality generated set
Example
Calculation:
The Higher the BLEU score is, the better quality the generator gets.
Problem :
Objective of our model is to generate data which is diverse from train data set.
BiLingual Evaluation Understudy (BLEU)
Candidate the the cat
Reference 1 the cat is on the mat
30. Evaluation
metric
Measures.. Reference set Evaluated set Desired
direction
BLEUdiv
Train-diversity Train data Generated data Lower
Self-BLEU Self-diversity Generated sentence Remaining generated
sentences
Lower
Normalized
perplexity
Quality Domain specific data Generated data Higher
Cosine Similarity Quality Train data Generated data Higher
Evaluation metrics
34. References
[1] Perrin, Andrew. "Social media usage: 2005–2015. 2015." URL http://www. pewinternet.
org/2015/10/08/social-networking-usage-2005-2015 (2017).
[2] Founta, Antigoni Maria, et al. "Large scale crowdsourcing and characterization of twitter abusive
behavior." Twelfth International AAAI Conference on Web and Social Media. 2018.
[3] Mikolov, Tomáš, et al. "Recurrent neural network based language model." Eleventh annual conference of the
international speech communication association. 2010.
[4] Mikolov, Tomáš, et al. "Extensions of recurrent neural network language model." 2011 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011.
[5] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997):
1735-1780.
[6] Graves, Alex. "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850
(2013).
[7] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks."
Advances in neural information processing systems. 2014.
[8]Yu, Lantao, et al. "Seqgan: Sequence generative adversarial nets with policy gradient." Thirty-First AAAI
Conference on Artificial Intelligence. 2017.
35. References
[9]Shrivastava, Ashish, et al. "Learning from simulated and unsupervised images through adversarial
training." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[10] Li, Jiwei, et al. "Adversarial learning for neural dialogue generation." arXiv preprint
arXiv:1701.06547 (2017).
[11]Lin, Kevin, et al. "Adversarial ranking for language generation." Advances in Neural Information
Processing Systems. 2017.
[12]Guo, Jiaxian, et al. "Long text generation via adversarial training with leaked
information." Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
[13]Yang, Zichao, et al. "Unsupervised text style transfer using language models as
discriminators." Advances in Neural Information Processing Systems. 2018.
[14]Z. Shi, X. Chen, X. Qiu, and X. Huang, “Towards diverse text generation with inverse
reinforcement learning,” arXiv preprint arXiv:1804.11258, 2018.
[15] Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for
automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pages 311–318.
ACL.
According to a report by the Pew research center, 7% of American adults use social networking sites in 2005, by 2017 it becomes 76%[1].
Roughly four-in-ten Americans have personally experienced online harassment, and 62% consider it a major problem [1] .
Particularly young folks use social networking sites as a tool to maintain their relations to friends, classmates.
Unfiltered and sometimes anonymous exchange of content in social media increase the likelihood of online harassment.
Hence, automatic online harassment detection is vital for online social networks.
machine learning plays a crucial role for building an automatic online harassment detection model. Well balanced training data supports building a good machine learning model for classification.
In our examination on abusive and hate speech tweets data set, we found that proportion of harassment labeled tweets to non-harassment labeled tweets is small. This dataset contains 46.5K tweets. Among those tweets 3500 tweets are labeled as harassment and 43000 tweets are labeled as non-harassment.
This scarcity of positive labeled data in the training set leads deep learning methods biased towards the majority class, which causes high misclassification for the minority class.
This is one of the most challenging aspects during the development of an automatic online harassment detection system.
These are some of the ways to address this positive labeled data scarcity problem.
the obvious way is to collect more positive labeled data.
Other way is active learning, which involves human in the loop. In this active learning, we first train our model on available labeled data. Then we classify unlabeled data using this trained model. Then we use humans to verify the labels with low confidence.
Another way is Data augmentation by generating self-diverse and train-diverse synthetic samples of minority class.
In this research we want to explore whether we can create augmented text data by generating synthetic samples to solve the positive labeled data scarcity problem.
Before we get into details of text generation, I want to brief about what is our problem and what method we choose to address this problem.
Our problem is scarcity of positive labeled data in train set which leads to poor classification model.
To address this problem, we choose self-diverse and train-diverse text generation method.
Text generation :
Text generation is one of the most attractive problems in NLP community. Text generation is a language model which learns the conditional probability of the next word in a sequence of words.
Deep learning methods become the state-of-the-art techniques for text generation.
Following are the state-of-the-art techniques for text generation
Recurrent neural networks (RNNs)
The objective of the RNN is to maximize the likelihood of true token in the training sequence given the previous observed tokens.
These RNNs have long term dependency learning problem:
For example : I grew up in “France”,.. I speak “French” fluently.
This kind of dependencies between sequence data is called long-term dependencies because the distance between the relevant information “France” and the point where it is needed to make a prediction “French” is very wide.
RNNs with LSTM solves this problem using memory cell.
But this model also suffers from exposure bias.
During training stage model predicts next token given previous generated tokens, which the model does not seen before. So, errors accumulate along the generation process. This we called as exposure bias.
General adversarial net (GAN) proposed by (Goodfellow and others 2014) is a promising framework for alleviating this problem.
This framework has two modules, generator and discriminator.
Objective of discriminator is to learn to distinguish whether a given data instance is real or not, and generator learns to confuse discriminator by generating high quality data.
In Sequence GANs, we treat discriminator as rewarder
Objective of generator is to get more rewards by generating meaningful sentences.
Objective of the rewarder is to give more rewards to train data and give less rewards to generated data.
In Rank GANs, we treat discriminator as Ranker
Objective of generator is to get high rank by generating meaningful sentences.
Objective of the Ranker is to give high rank to train data and give low ranks to generated data.
Inspite of their success, SeqGAN and Rank GAN still faces two challenges.
The first problem is Sparse rewards and second problem is mode collapse.
Learning process becomes inefficient by giving single reward to the entire sequence.
For example: I’m not sexist but it is female face I’m face not sexist
In this generated sequence, the part of the sequence I’m not sexist but it is female face have potential to become a meaningful sentence, but the entire sequence is not meaningful. By giving single reward to the entire sequence, in this case it would be low, we are discouraging the generator model to learn the part of the sequence which has potential to become a meaningful sentence.
Mode collapse
Because of mode collapse, generator generates a limited diversity of samples, or even the same sample, regardless of the input.
The latest work, Inverse reinforcement learning based GANs tackles the above two problems. We consider this as our base line model.
By giving rewards at token level, this method addresses the sparse reward problem.
By maximizing entropy at generator, the generator generates diverse text, which prevents mode collapse problem.
Here, the rewarder function is similar to the previous GANs. Which aims to give more reward to train data and low rewards to the generated data.
And the objective of the generator is to maximize the rewards of generated sentences plus entropy.
But the problem with this objective is after certain iterations, because maximized entropy generated sentence starts loses its meaning.
To solve the problem, we incorporated a term which uses domain specific knowledge to the preserve meaning.
But the problem with this objective is after certain iterations, because maximized entropy generated sentence starts loses its meaning.
that increasing entropy and rewards can get you better text generator. However, you will need more data to do this. If you do not have good enough data then it becomes hard for a generator to learn how to balance the reward maximization and exploration. Hence, it can result in less meaningful text as the generator puts more weight on the entropy or it can result in similar text generated if you put less weight on the exploration. Hence, we need an external data-source or knowledge-source to preserve meaning so that even after putting more weight on the exploration part (entropy), we dont' compromise on the meaning of the text.
In this dissertation, we extend the optimization to incorporate a term that preserves the meaning so as to generate diverse and meaningful text.