Enhancing Twitter spam discovery using cross account pattern matching.

ENHANCING TWITTER
SPAM DETECTION USING
CROSS ACCOUNT
PATTERN MATCHING.
By Ambarish Pande

Contents
▸ Introduction
▸ Motivation
▸ Proposed Algorithm
▸ Implementation Details
▸ Advantages and Drawbacks
▸ Conclusion and Future work

Introduction
▸ Emerging Social Networks.
▹ Popularity of Facebook and Twitter
▹ 1550 Million active FB users.
▹ 320 Million active Twitter Users.
▹ Global Reach
▹ Multi-platform
▸ Social Network’s Revenue Model
▹ Advertising
▹ 85% of Twitter’s Revenue comes
from advertising

Motivation
▸ The Problem
▹ Social networks like twitter provide
a legal way of publicizing content.
▹ Some companies go for illegal
methods like Spam Accounts.
▹ Huge Revenue Loss to Twitter
10,000,00 $ /YrMillions of Dollars per year. That’s a lot of money!

Motivation
▸ Existing Solution
▹ Twitter’s spam detection algorithm
focuses on criteria such as:
▹ harmful links
▹ aggressive following behavior
▹ posting to trending topics,
▹ posting duplicated tweets
▹ Low profile activity
▸ Drawbacks
▹ Spammers have evolved.
▹ Now Twitter cannot detect spam
based on existing algorithm

Proposed Algorithm
▸ Emphasis on interaction between
accounts and not on individual
accounts.
▸ Finding pattern with existing spam
tweets.
▸ Detecting spam accounts based on
tweets and spam tweets based on
accounts.

FLOW
CHART
TO
DETECT
SPAM
Identify Tweets with Malicious Links
Mining Spam Patterns
Spam Likelihood Estimation

Proposed Algorithm
Stage 1 :Identify Tweets with Malicious Links.
1. Collect tweets and user info.
2. Follow links in the Tweet
3. Check whether it is flagged by Twitter or any
other URL Shortening services (goo.gl or
bit.ly)
4. If yes Mark as Spam Else no
Leverage Twitter’s Database of Malicious
links.

Proposed Algorithm
Stage 2: Mining Spam Patterns. .
1. Strip off all URLS, @user mentions and
#hashtags.
2. Strip off all non alphanumeric characters
such as digits 0-9 or characters like *,!,@,#.
3. Create a hash for each stripped off tweet.
4. Compare the hash with hashes of other
tweets.
Find Pattern

Proposed Algorithm
Stage 3: Spam Likelihood Estimation.
1. Iterate through users and assign spam
scores based on the user’s tweets.
2. Iterate through tweets and assign spam
score based on the users of tweet.
Calculate Spam Score

Proposed Algorithm
Stage 3: Spam Likelihood Estimation.
Here comes the MATH

Implementation Details
▸ Data Collection
▹ Twitter java API - Twitter4j
▹ Registering App with twitter.

▸ Data Storage
▹ MySQL database.

3,79,867tweets
3,129users
▸ Twitter API has Rate Limits to Number of
Requests.
▸ 180 Request / 15 min

▸ Stage 1 Implementation
▹ JSoup - Web Crawler for Java
● t.co - Warning: this link may
be unsafe
● Goo.gl - The site ahead
contains malware
● Bit.ly - STOP - there might be
a problem with the
requested link

▸ Stage 1 Stats
▹ After implementing the first stage of the
algorithm

▸ Stage 2 Implementation
▹ Regular Expressions to Strip Off
#hashtags, @user mentions, URLs,
special characters and numbers
▹ Used MD5 Algorithm to generate unique
hashes.
▹ Tweets with same hash values were
marked as spam.

▸ Stage 2 stats
▹ 13015 duplicate hashes were found
▹ It covered 70,728 tweets

▸ Stage 3 Stats
▹ Spam tweets which were not initially
labelled by first two stages were found
out.
▹ Users which tweet more spam were
assigned high Spam Score.
▹ And tweets which are tweeted by such
accounts are also assigned higher Spam
Score

Drawbacks
▸ Not good enough in detecting human
controlled spam accounts.
Advantages
▸ Detects bot controlled spam accounts.
▸ Easily detect Spam Campaigns.
▸ Spam tweets with different user mentions
and links are also detected.
▸ Excessive ReTweets to unrelated topics are
also treated as Spam.

Conclusion and Future Work
▸ Cross Account pattern matching method is
highly effective.
▸ Old Methods do not work nowadays.
▸ For Future Work
▹ Clustering of tweets to understand
topics which spammers use the most
▹ Providing a real time spam discovery
solution by implementing Machine
Learning.

Refrences
[1] Publication
http://dl.ifip.org/db/conf/im/im2015m/1374
46.pdf

Enhancing Twitter spam discovery using cross account pattern matching.

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Enhancing Twitter spam discovery using cross account pattern matching.

Ähnlich wie Enhancing Twitter spam discovery using cross account pattern matching. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Enhancing Twitter spam discovery using cross account pattern matching.