Studied and implemented an algorithm proposed in a research paper to discover spam accounts and spam tweets on twitter.
Technologies used:
1) JAVA
2) Twitter API
3) MySQL database
3. Introduction
▸ Emerging Social Networks.
▹ Popularity of Facebook and Twitter
▹ 1550 Million active FB users.
▹ 320 Million active Twitter Users.
▹ Global Reach
▹ Multi-platform
▸ Social Network’s Revenue Model
▹ Advertising
▹ 85% of Twitter’s Revenue comes
from advertising
4. Motivation
▸ The Problem
▹ Social networks like twitter provide
a legal way of publicizing content.
▹ Some companies go for illegal
methods like Spam Accounts.
▹ Huge Revenue Loss to Twitter
10,000,00 $ /YrMillions of Dollars per year. That’s a lot of money!
5. Motivation
▸ Existing Solution
▹ Twitter’s spam detection algorithm
focuses on criteria such as:
▹ harmful links
▹ aggressive following behavior
▹ posting to trending topics,
▹ posting duplicated tweets
▹ Low profile activity
▸ Drawbacks
▹ Spammers have evolved.
▹ Now Twitter cannot detect spam
based on existing algorithm
6. Proposed Algorithm
▸ Emphasis on interaction between
accounts and not on individual
accounts.
▸ Finding pattern with existing spam
tweets.
▸ Detecting spam accounts based on
tweets and spam tweets based on
accounts.
8. Proposed Algorithm
Stage 1 :Identify Tweets with Malicious Links.
1. Collect tweets and user info.
2. Follow links in the Tweet
3. Check whether it is flagged by Twitter or any
other URL Shortening services (goo.gl or
bit.ly)
4. If yes Mark as Spam Else no
Leverage Twitter’s Database of Malicious
links.
9. Proposed Algorithm
Stage 2: Mining Spam Patterns. .
1. Strip off all URLS, @user mentions and
#hashtags.
2. Strip off all non alphanumeric characters
such as digits 0-9 or characters like *,!,@,#.
3. Create a hash for each stripped off tweet.
4. Compare the hash with hashes of other
tweets.
Find Pattern
10. Proposed Algorithm
Stage 3: Spam Likelihood Estimation.
1. Iterate through users and assign spam
scores based on the user’s tweets.
2. Iterate through tweets and assign spam
score based on the users of tweet.
Calculate Spam Score
16. Implementation Details
▸ Stage 1 Implementation
▹ JSoup - Web Crawler for Java
● t.co - Warning: this link may
be unsafe
● Goo.gl - The site ahead
contains malware
● Bit.ly - STOP - there might be
a problem with the
requested link
18. Implementation Details
▸ Stage 2 Implementation
▹ Regular Expressions to Strip Off
#hashtags, @user mentions, URLs,
special characters and numbers
▹ Used MD5 Algorithm to generate unique
hashes.
▹ Tweets with same hash values were
marked as spam.
20. Implementation Details
▸ Stage 3 Stats
▹ Spam tweets which were not initially
labelled by first two stages were found
out.
▹ Users which tweet more spam were
assigned high Spam Score.
▹ And tweets which are tweeted by such
accounts are also assigned higher Spam
Score
21. Drawbacks
▸ Not good enough in detecting human
controlled spam accounts.
Advantages
▸ Detects bot controlled spam accounts.
▸ Easily detect Spam Campaigns.
▸ Spam tweets with different user mentions
and links are also detected.
▸ Excessive ReTweets to unrelated topics are
also treated as Spam.
22. Conclusion and Future Work
▸ Cross Account pattern matching method is
highly effective.
▸ Old Methods do not work nowadays.
▸ For Future Work
▹ Clustering of tweets to understand
topics which spammers use the most
▹ Providing a real time spam discovery
solution by implementing Machine
Learning.