SlideShare a Scribd company logo
1 of 58
Download to read offline
Automatic Detection of Click Fraud in Online Advertisements
by
Abhishek Agarwal, M.S.
A Thesis
In
COMPUTER SCIENCE
Submitted to the Graduate Faculty
of Texas Tech University in
Partial Fulfillment of
the Requirements for
the Degree of
MASTER OF SCIENCE
Approved
Dr. Rattikorn Hewett
Chair of Committee
Dr. Sunho Lim
Dr. Eunseog Youn
Peggy Gordon Miller
Dean of the Graduate School
August, 2012
Texas Tech University, Abhishek Agarwal, August 2012
ii
ACKNOWLEDGMENTS
I would like to thank Dr. Rattikorn Hewett for her guidance throughout my Master‟s
research. Her in-depth knowledge of the subject, focus on clarity and quality of work has
helped me learn skills which will help me for the rest of my career. Her guidance on the
research is invaluable and has helped me cope with the challenges I faced throughout the
course of this work.
Texas Tech University, Abhishek Agarwal, August 2012
iii
TABLE OF CONTENTS
Acknowledgments ........................................................................................................ii
Abstract......................................................................................................................... v
List of Tables ............................................................................................................... vi
List of Figures............................................................................................................. vii
Motivation..................................................................................................................... 1
Contributions...................................................................................................... 5
Background Work........................................................................................................ 7
Preliminaries................................................................................................................. 9
Terms.................................................................................................................. 9
Problem Statement ........................................................................................... 10
Assumptions..................................................................................................... 10
Mathematical Theory of Evidence................................................................... 11
Mass Functions...............................................................................................12
Combination Rule ...........................................................................................14
Proposed Dempster Shafer Theory for Click Fraud Detection ............................. 16
The Core Element of Dempster Shafer Theory................................................ 16
Mass functions for Click Fraud Detection ....................................................... 17
Evidence 1: Number of clicks on the ad .........................................................17
Evidence 2: Time spent in browsing...............................................................18
Evidence 3: Ad-Visit after non-ad visit............................................................18
Evidence 4: Time of Click ...............................................................................19
Evidence 5: Place of origin of click .................................................................20
Evidence 6: Creating of membership..............................................................21
Evidence 7: Adding a product in shopping cart ..............................................22
Data Set & Illustration .............................................................................................. 24
Data Description............................................................................................... 24
Example of belief computation using mass function and combination ........... 28
Evaluation................................................................................................................... 34
Case Study 1..................................................................................................... 34
Case Study 2..................................................................................................... 45
Texas Tech University, Abhishek Agarwal, August 2012
iv
Discussion & Conclusions.......................................................................................... 48
Bibliography ............................................................................................................... 50
Texas Tech University, Abhishek Agarwal, August 2012
v
ABSTRACT
Increasing advancement, access and availability of the Internet Technology have intensified
the growth of Internet users over the last decade. This has made online advertising a popular
venue for many companies to market their products and services. Today, online advertisement
is one of the most important sources of revenues that impact the economy of many large
enterprises. In online advertisement, an advertiser pays a broker (e.g., Google, Yahoo), who
normally has a search engine, to post its online advertisement, which can be on any
appropriate publisher site. The publisher earns revenues from the broker for each click on the
advertisement posted on its site, while the advertiser will be charged. Thus, when an
excessive number of clicks occur, this can quickly dry up the fund of a rival company and
drive it out of the competing advertisement. At the same time, each click adds revenue to the
publisher. This motivates click frauds, which refer to malicious acts to create fraudulent clicks
with the intent to increase revenue or drive away competitors without real interest in the
products or services being advertised. Identifying click frauds is a difficult problem because
of the dynamic nature of the click behaviors, some of which are generated by humans and
some are by automated software called bots. There have been previous work attempting to
identify click frauds using various techniques but they tend to be limited by the types of the
data, the way they are processing or assumptions that are not always achievable.
This thesis presents an approach to automatically detecting click frauds in online advertising.
The approach uses a mathematical theory of evidence to estimate the likelihood of a click
whether it is fraud or genuine using web log data of a user‟s activities on the advertiser‟s
website. One advantage of the proposed approach is the fact that the likelihood can be
computed for each incoming click and thus it gives an online computation of the belief that
fits well with the dynamic behaviors of users. The thesis describes the approach and evaluates
its validity using two real-world case studies. We believe the approach is general in that it
can be applied to any scenario.
Texas Tech University, Abhishek Agarwal, August 2012
vi
LIST OF TABLES
4.1 Fraud certification rules ....................................................................... 23
5.1 Sample log data.................................................................................... 25
5.2 Input from server log............................................................................ 28
5.3 Coefficient values................................................................................. 29
5.4 Mass function beliefs for illustrated example...................................... 31
6.1 Computed belief values for Case Study 1............................................ 43
6.2 Computed belief values for first IP ...................................................... 46
6.3 Computed belief values for second IP ................................................. 46
6.4 Computed belief values for third IP..................................................... 46
Texas Tech University, Abhishek Agarwal, August 2012
vii
LIST OF FIGURES
1.1 % change of revenue for advertising media (GeekWire, 2012)............. 1
1.2 Google‟s revenue source distribution in 2011 (Google Earnings
Report, 2011) ......................................................................................... 2
1.3 Scenario before click fraud occurred ..................................................... 3
1.4 Scenario after click fraud occurred ........................................................ 4
4.1 Click fraud detection framework using D-S theory............................. 16
5.1 Legends for timeline diagram .............................................................. 27
5.2 Timeline diagram sample data in Table 5.1......................................... 27
5.3 Timeline diagram for Table 5.2 ........................................................... 28
5.4 Combined belief of fraud for input in Figure 5.3................................. 33
6.1 Timeline input for Case Study 1 .......................................................... 34
6.2 Belief of fraud from mass function 1 ................................................... 36
6.3 Belief of ~fraud from mass function 2................................................. 37
6.4 Belief of ~fraud from mass function 3................................................. 38
6.5 Belief of fraud from mass function 4 ................................................... 39
6.6 Belief of fraud from mass function 5 ................................................... 40
6.7 Belief of ~fraud from mass function 6................................................. 41
6.8 Belief of ~fraud from mass function 7................................................. 42
6.9 Combined belief of fraud for Case Study 1 ......................................... 44
6.10 Timeline diagram for Case Study 2 ..................................................... 45
6.11 Combined belief values for Case Study 2............................................ 47
Texas Tech University, Abhishek Agarwal, August 2012
1
CHAPTER I
MOTIVATION
The Internet has seen tremendous growth in the last decade and according to current
statistics from the World Bank, nearly 32% of the world population currently uses the
Internet. This has made online advertising not only lucrative but also an important medium for
businesses to reach out to a large consumer base (Jansen, 2007). Figure 1.1 below shows that
while most other media of advertisement are losing market share, online advertisements are
growing tremendously.
Figure 1.1 % change of revenue for advertising media (GeekWire, 2012)
Not only do online ads benefit advertisers, they are also a rich source of revenue for
publishers who display ads on their websites and brokers like Google, Yahoo, MSN, Ask.com
etc. who provide the technical platform for online advertisements. Thus, online ads drive the
Internet economy and are the necessary life blood for its survival and growth. Figure 1.2
below shows that in 2011 97% of Google‟s revenue was from online ads alone.
Texas Tech University, Abhishek Agarwal, August 2012
2
Figure 1.2 Google‟s revenue source distribution in 2011 (Google Earnings Report, 2011)
Online advertising is however not free of issues and click fraud is a major problem
which can impact its growth. Click fraud is a type of crime in online advertisement in which a
user clicks on an ad not with a genuine interest in what the advertiser has to offer but with
intent of either generating illegal revenue (for the publisher that hosts the advertisement) from
clicks or to intentionally cause monetary loss to the advertiser. It hurts the advertisers and
may deter them from investing in online ads.
Many advertising mechanisms exist including the pay-per-click (PPC) scheme which
contributes to about 57 percent of all the Internet ads with nearly more than US$16 billion in
revenue in 2010 (Tuzhilin, 2006; IAB and PwC, 2010). A popular example of PPC scheme is
the Google Adsense. In PPC brokers like Google place targeted ads in dedicated ad spaces on
publisher websites. Brokers get paid by advertisers for every click on the ad and they share
the income generated this way with the publishers. While PPC is a great model for online
advertisement, it suffers the most from the problem of click fraud (Tuzhilin, 2006). Most of
the publishers in PPC programs are small time blog owners and are the source of majority of
the click fraud. Competitors of an advertiser can also commit click fraud in order to reduce
competition and it may indirectly benefit their business. To commit click fraud, publishers or
Texas Tech University, Abhishek Agarwal, August 2012
3
competitors can click on the ad themselves, ask friends to do it, use an Internet bot script
which repeatedly clicks on the ads or hire people to do it for them (Kshetri, 2010). Such clicks
are of no value to the advertisers as the clicker has no intent to buy their product or service,
use information or carry out any transaction useful to the advertiser‟s business (Jansen, 2007).
The brokers too have an incentive in not filtering out all the click fraud as doing so will
reduce their revenues. They can contribute to click fraud by passively letting the fraud happen
and not taking adequate measures to stop it. The lesser known brokers have a greater
incentive in doing so (Kshetri, 2010). Multiple lawsuits filed by various advertisers against
Google and Yahoo for not taking adequate steps to curb click fraud are an indication of
brokers‟ inability or unwillingness in this regard. Figure 1.3 below shows a scenario before
click fraud when the advertiser‟s money reserve (advertising budget) is full. The publisher,
broker or competitors have not generated any illegal revenue from click fraud.
Figure 1.3 Scenario before click fraud occurred
Figure 1.4 below shows the scenario after click fraud which caused advertiser‟s budget to
completely deplete and the broker, publisher and competitor‟s illegal profit to increase.
Texas Tech University, Abhishek Agarwal, August 2012
4
Figure 1.4 Scenario after click fraud occurred
Reputed brokers like Google actively try to contain click fraud by filtering out
fraudulent clicks and permanently blocking publishers who are found involved (Tuzhilin,
2006; Kshetri, 2010). They have access to a user‟s search activities and the data they collect
from the publisher to find patterns in a user‟s behavior. The idea is to estimate a user‟s
intention behind the click in order to rate a click as genuine or fraudulent. However they may
not have access to the data about a user‟s actions on the advertiser‟s website where the user is
taken following the click. This is because the advertiser may choose to share limited or no
data at all with the broker due to their own privacy concerns (Tuzhilin, 2006).
Brokers provide aggregate statistics to advertisers and do not share details on which
clicks they found fraudulent in order to avoid making their detection mechanisms open to
fraudsters. Thus advertisers are not adequately informed and there is a strong case for the
advertisers to have their own click fraud detection system in place. This way the advertisers
can protect themselves not only from fraudulent publishers and competitors but also from
brokers who either fail to detect fraud or let it occur willingly. Such a system can help them
estimate the extent of the fraud in their ad campaign and pay the brokers for genuine clicks
only. It is important to note here that brokers have access to much larger sources of
information than advertisers. The advertisers must be able to do the click fraud detection with
the limited data they have about users‟ action at their website.
Texas Tech University, Abhishek Agarwal, August 2012
5
Click fraud identification is a difficult problem to solve. Fraud mechanisms evolve and
continually change over time. The fraud can be carried out both by humans and software bots
with distinctive characteristic behaviors. It is difficult to track users with their IP addresses as
IPs are generally dynamic in that a IP address of the same user may change anytime. A
software bot too can use different IP addresses at a time to carry out click attacks. Finally, the
advertiser has access to data from their server, which gives very limited information about a
user‟s behaviors.
Contributions
This paper presents an approach to automatically detecting click fraud at the ad-site.
The advertisers can use the proposed approach to detect their click frauds. Our approach
employs the mathematical theory of evidence called Dempster-Shafer (DS) Theory (Shafer,
1976; Denoeux, 1995; Dong et al., 2010; Sentz et al., 2002) for evidence-based reasoning to
estimate the likelihood of a click being fraudulent based on the evidence gathered from the
weblog data available to the advertiser. The proposed approach can also be useful for brokers
for computing correct charges to their clients if the data are available to them. Our approach is
based on a widely used theory that allows the estimate of the likelihood to be computed as
each incoming click is exhibited. That is it offers an on-line computation. Thus, after each
click from a given IP we can estimate our belief if the click is suspicion to be fraudulent or
not. In summary the contributions of this thesis include: (1) an approach for automatically
detecting or identifying click frauds, (2) a framework for reasoning about click frauds that
integrates relevant information extracted from weblog data with the evidence based reasoning
to update click fraud analysis in real-time, and (3) core elements of the proposed approach
that consists of a set evidences required in detecting click frauds. These evidences will be
formulated in terms of functions called mass functions used in the DS theory.
The rest of this thesis is organized as follows: Chapter II presents background work
on click frauds identification. Chapter III gives preliminaries including terms and relevant
concepts, the problem formulation and its assumption, and the Dempster-Shafer Theory along
with its fundamental elements. Chapter IV presents our approach to the problem and the
details of the core contribution on formulating mass functions for click fraud identification
Texas Tech University, Abhishek Agarwal, August 2012
6
problem. Chapter V explains the data set used for the approach and gives an illustrative
example. Chapter VI evaluates the proposed approach with experiments on synthetic data
generated on two case studies. Chapter VII gives concluding remarks and possible extension
for future work.
Texas Tech University, Abhishek Agarwal, August 2012
7
CHAPTER II
BACKGROUND WORK
Many different types of solutions have been proposed to counter click fraud.
(Tuzhilin, 2006) suggested a model where the advertisers pay for a click only if it leads to a
conversion event like purchase to counter CF. Such a model is economically unviable for
publishers and so is not available to advertisers. Another method proposed (Tuzhilin, 2006) is
the use of data mining models based on past data to classify clicks as fraud or ~fraud (not
fraud). Such a solution may suffer from high inaccuracy as fraud mechanisms evolve and
change over time. There is an assumption that past clicking behavior is indicative of future
behavior. A large number of past clicks which can be truly classified as valid or invalid are
also required. This is a batch process and not online. Moreover such datasets are at the
disposal of brokers only and other involved parties like advertisers cannot use them. The
author clearly states these limitations.
(Haddadi, 2010) discusses the use of bluff ads for detecting sources of click fraud like
trained bots or poorly trained human workforce employed to carry out fraud. The display text
of these ads is unrelated to the context of the user to whom they are displayed. For example a
user in Australia should not ideally be shown an ad of a special offer on pizza in New York
City. A click by the user is unnatural in this case and will indicate that the user is a bot or
human involved in fraud. However careful humans and sophisticated bots can still beat it.
Also this is a „broker-centric‟ model. This can be implemented by brokers and advertisers
need to completely trust brokers in this.
Recently (Antoniou et al., 2011) proposed a burst detection algorithm to detect high
frequency of user activity in short time periods to detect various types of click frauds
including voting click fraud, frauds related to blog post popularity, search engine retaliation
and advertising click fraud. While this is a good general solution for all types of click frauds
mentioned, it does not cater to the nuances of advertisement click fraud, as a simple detection
of bursts may not be enough to differentiate between valid and invalid clicks. More
Texas Tech University, Abhishek Agarwal, August 2012
8
factors/evidences need to be taken into consideration before we could conclusively label a
click as fraudulent. (Walgampaya et al., 2011) proposed a method to detect bot scripts
involved in click fraud using Bayesian Classifiers.
The methods above are either not sufficient to combat the problem of click fraud
individually or require broker involvement of some kind. The involvement may be in the form
of policy changes by brokers or sharing data at their disposal and they have been unwilling for
both. As a result they cannot be used by advertisers to actively detect fraud at their site.
(Kantardzic et al., 2010) proposed a real time click fraud detection and prevention
system. It uses D-S Theory for multilevel data fusion of evidences from different sources like
IP address, referrer, country etc. However they rely on data from both the client (advertiser)
and server (broker). An advertiser does not have access to broker‟s data and hence this system
is limited to be used by brokers only. Our approach equips advertisers with a fraud detection
system using only the data at their disposal. The evidences that they extract from server data
to formulate mass functions are very basic whereas some of our rules are sophisticated and
novel to the best of our knowledge. We do not maintain any historical databases and exploit
the fact from (Antoniou et al., 2011) that fraud will happen in bursts. Our approach is simple
yet our set of rules is powerful and comprehensive making it difficult for fraudsters to carry
out any viable attacks on the advertiser. For example, rules 1, 2, 4 and 5 make it difficult for a
bot to generate clicks without detection.
Texas Tech University, Abhishek Agarwal, August 2012
9
CHAPTER III
PRELIMINARIES
This section outlines the foundation for the proposed method of click fraud detection
and the assumptions we have taken.
Terms
We now define terms used in this thesis.
 Advertiser is a seller with an e-commerce website who pays for his ads to be displayed on
other sites. These ads may create more traffic and revenue for the advertisers since a user
who clicks on these ads is directed to their site.
 Ad-site is the advertiser‟s website. A user on the Internet can visit the ad-site by several
means like using an Internet search, typing the URL of the advertiser on their browser,
bookmark the advertiser and clicking it later or clicking on the ad on a publisher site.
 Ad-visit is a visit of a user to ad-site by clicking an ad. Non-ad visit is a user visit by any
means other than clicking an ad.
 Session is a continuous period of time that a visitor navigates within the advertiser‟s site.
In other words it is the duration for which a user maintains an active HTTP connection
with the server. In a session the user can be browsing, reading, watching videos, filling out
forms, registering for membership, adding products in a shopping cart, purchase products
etc.
 Publishers are the websites which hosts ads for the advertisers and get paid for the click
on those ads. Common examples are blogs and news sites.
 Broker is an intermediary between advertiser and publisher. They provide the technical
platform for online advertisements. They are mostly Internet search engine companies like
Google, Yahoo, AOL, Ask.com etc. and use their search technology to serve targeted ads
on publisher sites based on website content, geographical location etc..
Texas Tech University, Abhishek Agarwal, August 2012
10
 Pay Per Click (PPC) is an online advertising model in which publishers display ads on
their websites and get paid for each click on those ads. Google runs a PPC program called
Adsense.
 Gclid is a unique ID called that is attached to the server log for every click that was made
on Google ads. This helps identify unique visitors to the best approximation as Google
uses various parameters to make this unique identification.
Problem Statement
Given a weblog data at the site of the advertiser over a period of time, find all
occurrences of click fraud. For every such occurrence, identify its owner by its corresponding
IP address. The advertiser‟s web server log data has information such as IP address, date &
time, Gclid number (to be described later), a requested page and referrer for every click.
Assumptions
Due to the dynamic natures of IP addresses associated to each user, to solve the above
problem in real practice, it is necessary to make the following assumptions.
1) IP addressing changes over time and a user may be assigned to different IP addresses
while he/she is surfing the Internet. A user (either human or bot) may try to carry out
fraudulent clicks using as many different IPs as possible in order to avoid detection.
Therefore it is not feasible to use a long duration data of an IP. Instead we use a short
duration of a window W. In this work, W is specified to be 30 minutes during which we
assume that the IP address for a user will not change. This duration is typical and is
reasonable though is quite different from other existing work. The probability that a user
with a particular IP clicked on an ad and that the same IP is assigned to another user who
also clicks on the same ad within the proposed window is negligibly low. Our approach is
however not limited by this window size and one can pick a size that suits them well.
2) A fraudster has an incentive in clicking on an ad multiple times but no intention in making
an actual purchase of a product or service. Fraudsters will make money on clicking on the
ads but will have to spend money to make purchases and this is strictly against their end
goal. Thus, if a user makes a purchase at the ad-site, we assume that the user is not
Texas Tech University, Abhishek Agarwal, August 2012
11
involved in fraud. However in some circumstances (like in order to confuse detection
systems), the fraudster may make a purchase. Such an action will not help the fraudster as
soon as he moves out of the time window W.
3) Fraudulent clicks with large time gaps in between every two clicks do not deliver any
substantial monetary gain to the fraudster. The number of clicks has to be large enough
with shorter gaps between them and therefore, a burst of clicks may indicate Click-Fraud
(Antoniou, 2011).
4) Since HTTP is a stateless protocol it is difficult to accurately estimate the session
duration. We sum the time difference between consecutive HTTP requests by the user to
get the total session time but however there is no way to compute the exact time spent by
the user viewing the last page since there is no request after that. We thus had to make an
assumption that 30 seconds was spent on the last page. Our approach is however not
limited by this assumption and any other duration can be assumed for the last page view.
5) We modeled our approach around Google‟s Adsense as it is the most widely accepted Pay
Per Click program. We use gclid, a unique id attached by Google to the web server logs of
advertisers for every click that was made on their ads. It follows Google‟s definition of
unique visits. Google claims that it uses various parameters to assign unique gclids and
third party CF detection engines which use the gclid are more accurate than others. So we
take data filtered by the broker (Google) and apply our own approach for further filtration.
However our approach can be modeled around any other PPC program and the way to
identify the clicks that were made on advertisements could be by creating unique landing
pages. This way by looking at server logs we can separate out visits made from ads.
Mathematical Theory of Evidence
Efforts in identifying click fraud have mostly concentrated on identifying a certain
characteristic of user behavior and this is quite different from our approach. To provide a
theoretical background of our approach we describe the mathematical theory of evidence also
known as the Dempster-Shafer (D-S) Theory (Shafer, 1976; Denoeux, 1995; Dong et al.,
2010; Sentz et al., 2002). It is related to traditional probability and set theory but is not the
Texas Tech University, Abhishek Agarwal, August 2012
12
same. The D-S theory allows probability assignment to a set of atomic elements rather than an
atomic element and it can be used to represent not only the likelihood of occurrence of an
event but also the uncertainty associated with it.
Using the D-S Theory evidence, which is coming from multiple sources with varying
level of certainty, can be effectively combined online. Its ease of use combined with a wide
and successful application in many areas makes it an ideal candidate for application in click
fraud detection which requires a complex model with several evidences.
In our problem domain a user can either be a fraud or not a fraud (~fraud). So we
have a finite set of hypothesis (atomic elements) in the problem domain U = {fraud, ~fraud}.
The power set of U is a set {{fraud}, {~fraud}, {U}, {}}. Each of the four elements in the
power set represents a belief between 0 and 1. {fraud} represents a belief of the user being a
fraud; {~fraud} represents the belief of the user being not fraud; U represents the belief of
user being both fraud and ~fraud and thus it represents the uncertainty;  is an empty (null)
set and it represents a contradiction, thus it is always 0. DS-Theory assigns belief to all the
elements of this power set of U rather than mutually exclusive events of U. The sum of all
belief values in the power set of U is 1.
Mass Functions
A degree of belief is represented as a belief function called mass function m which
provides a probability assignment to any AU, where m() = 0 and m(fraud) + m(~fraud) +
m(U) = 1.
m() = 0
m(fraud) ∈ [0, 1]
m(~fraud) ∈ [0, 1]
m(U) ∈ [0, 1]
X Am(X) = 1
Texas Tech University, Abhishek Agarwal, August 2012
13
The mass m(A) represents a belief exactly on A. For example, U = {faulty, ~faulty}
represents a hypotheses of a suspect being both faulty and non-faulty. A situation in which
m({fraud, ~fraud}) = 1 occurs where there is no certainty regarding an evidence at all and this
cannot be adequately represented with traditional probability theory. A belief mass is
therefore different from probability. As we see above the probabilities are being assigned to
sets rather than mutually exclusive singletons (Shafer, 1976; Sentz et al, 2002). When the
probabilities are assigned to mutually exclusive events i.e. either fraud or ~fraud such that
m(U) is always 0 then DS-Theory becomes same as probability theory. For every mass
function, there are associated functions of belief and plausibility. The degree of belief on A,
bel(A) and the plausibility of A, pl(A) defined to be respectively:
bel(A) = X Am(X)
pl(A) = 1 – bel(~A) =X  A   m(X).
For example, bel({fraud}) = m({fraud}) + m() = m({fraud}). In general, bel(A) =
m(A) for any singleton set AU and in such a case the computation of bel is greatly reduced.
However, bel(A) is not necessary the same as m(A) when A is not a singleton set. Thus, m,
bel and pl can be derived from one another. Thus, belief and probability are different
measures. In this thesis, we use the terms likelihood and belief synonymously.
For our approach we use multiple evidences each of which contributes to either a
belief (or disbelief) that a user is a fraud depending on the nature of the evidence and its
quantified value (Dong et al., 2010). For example, if a user clicks many times on an ad, it
becomes evidence that the user is a fraud. Each evidence can support a user for either fraud or
~fraud but not both. If an evidence for a user supports fraud, the rest of the belief from the
evidence cannot commit only to the universal set U which quantifies the uncertainty. If
evidence i supports that the user is fraud then the mass functions for the evidence are defined
as follows:
mi(fraud) = α*f
mi (~fraud) = 0
Texas Tech University, Abhishek Agarwal, August 2012
14
mi (U) = 1 - α*f
Where 0 < α < 1, is an empirically derived value that signifies the strength of the evidence
in supporting the user is fraud. 0 < f < 1, is a function that is used to quantify the evidence.
If evidence i supports that the user is ~fraud then the mass functions for the evidence
are defined as follows:
mi(fraud) = 0
mi (~fraud) = β*g
mi (U) = 1 - β*g
Where 0 < β < 1, is an empirically derived value that signifies the strength of the evidence in
supporting the user is ~fraud. 0 < g < 1, is a function that is used to quantify the evidence.
Combination Rule
Since we have multiple mass functions, we need a way to combine them. A mass
function can be combined using various rules including the popular Dempster’s Rule of
Combination, which is a generalization of the Bayes rule. For X, A, BU, a combination rule
of mass functions m1 and m2, denoted by m1m2 (or m1, 2) is defined as the following:
where K =
and m1m2 () = 0
The combination rule can be applied in pairs repeatedly to obtain a combination of
multiple mass functions. The above rule strongly emphasizes the agreement between multiple
sources of evidence and ignores the disagreement by the use of a normalization factor.

m1AB (A)m2(B)

m1,2 ( X )  m1  m2 ( X ) 
m1AB X ( A)m2 (B)
1 K
Texas Tech University, Abhishek Agarwal, August 2012
15
Texas Tech University, Abhishek Agarwal, August 2012
16
CHAPTER IV
PROPOSED DEMPSTER SHAFER THEORY FOR CLICK FRAUD DETECTION
We propose an approach that can be used by the advertisers to detect fraud in real time
using data available to them, without any data from the broker which can either be impossible
to acquire or very limited if at all possible. This section describes our approach in detail and
the mass functions that have been developed to compute the belief of fraud.
The Core Element of Dempster Shafer Theory
Figure 4.1 below shows the framework elements of click fraud detection using our
approach. A user‟s clicking activity is captured by the advertiser‟s web server logs. The server
logs are updated in real time as users request pages from the server and the click fraud
detection system reads this data as soon as it is logged. For a latest click that the system is
processing, it finds the IP address and reads all the log data from that IP in the window W.
This data is pre-processed to extract out meaningful
Figure 4.1 Click fraud detection framework using D-S theory
Texas Tech University, Abhishek Agarwal, August 2012
17
evidences and then formulated into various mass functions. Each mass function computes a
belief of fraud which is unique and can conflict with the beliefs from other mass functions.
These beliefs are combined using Dempser‟s combination rule. The combined belief is
categorized into fraud, ~fraud or suspicious by using a set of threshold values. This process is
repeated for every new user click.
Mass functions for Click Fraud Detection
Using the user behavior from the weblogs at the advertiser‟s site as evidences to
reason about click fraud we formulate mass functions based on each of such core evidence.
These evidence are contributed by various factors such as number of clicks on the ad, time
spent browsing the advertiser site etc. The mass functions are used to compute belief value on
the click being fraud or not fraud (~fraud). The belief value from different evidences is
combined as each of them occurs in the data. A mass function contributes to either a belief (or
disbelief) that a user is a fraud depending on its nature and its quantified value. The following
gives detailed formulae of mass functions based on each evidence. The values αi and βi for
evidence i represent the strength of the evidence in mass function formulation (mi). In
practice these values will be empirically derived.
Evidence 1: Number of clicks on the ad
If the number of clicks on the ad from an IP in the time window W (30 minutes) is
high, then likelihood of the user being a fraud is high. Fraudsters have a natural incentive of
making more money by clicking the ads many times in a short period of time (short bursts).
The more they click, the more illegal revenue they generate for themselves. The Basic Mass
Assignment (BMA) for this evidence will always support a belief of fraud whose value
depends on the number of clicks.
Let n be the number of clicks in the window W.
Likelihood of the fraud = 1 – 1/n
m1( fraud) = α1 (1-1/n) (1)
Texas Tech University, Abhishek Agarwal, August 2012
18
m1 (~fraud) = 0 (2)
m1 (U) = 1 - m1 ( fraud ) = 1 – α1 (1-1/n) (3)
Evidence 2: Time spent in browsing
If the time spent by the user at the ad-site is high then he/she is less likely to be a
fraud. A genuine user will click the ad due to a real interest in advertiser‟s content (advertised
product, service or website content) and is likely to spend more time exploring the ad-site
than a fraudster. Fraudsters are less likely to do so since they are not interested in the product
and so that they could do more clicks in a given time. The BMA for this rule will always
support a belief of ~fraud whose value depends on the time spent at the ad-site. As a user
continues to spend more time at the ad-site the belief that he is ~fraud will increase.
Let t be the time spent by the user in all visits in the time window W (30 minutes) where 0 < t
<= 30 minutes. The likelihood of ~fraud increases as t increases.
m2 (fraud) = 0 (4)
m2 (~ fraud) = β2 *(t/W) (5)
m2 (U) = 1 - m2 (~ fraud ) = 1 – β2* (t/W) (6)
Evidence 3: Ad-Visit after non-ad visit
If a user clicks on an ad after a non-ad visit, then he is likely to be a fraud. Once a user
makes a non-ad visit to the ad-site, it implies that the user is aware how to reach the site apart
from clicking on the ad. Clicking on an ad after that seems unnecessary and indicates a
likelihood of fraud. The BMA for this rule can support a belief of either fraud or ~fraud
behavior.
Let x be the likelihood of fraud. If the user has visited only via ads then x=0.1 (little
likelihood of fraud). If the user has visited via ads after visiting normally then x=1.0 (high
likelihood of fraud). Thus the mass functions when the evidence supports fraud are as
follows:
Texas Tech University, Abhishek Agarwal, August 2012
19
m3 (fraud) = α3 *(x) (7)
m3 (~ fraud) = 0 (8)
m3 (U) = 1 - m3 ( fraud ) = 1 - α3*(x) (9)
Let y=1.0 be the likelihood of ~fraud if the user does not have an ad-visit after a non-ad visit.
The mass functions if the evidence supports ~fraud are as follows:
m3 (fraud) = 0 (10)
m3 (~ fraud) = β3 *(y) (11)
m3 (U) = 1 - m3 ( ~fraud ) = 1 – β3 *(y) (12)
Evidence 4: Time of Click
If the click occurred in the most suspicious time (or most active period of fraud
activity) then the user is likely to be a fraud. Fraudsters are generally known to be active
during certain hours of the day and a click at such hours can be indicative of fraudulent
activity. We follow Universal Time to determine this and not any particular time zone. If a
click happens at that certain time slot of suspicion then the click is likely to be a fraud
otherwise ~fraud. The BMA for this rule will support a belief of fraud if the time of click lies
in the suspicious time range. Otherwise it will support a belief of ~fraud.
Let Tstart and Tend be the start and end of the suspicious time range, t be the time of click.
Let x=1.0 be the likelihood of fraud if t lies between Tstart and Tend. The mass functions when
the evidence supports fraud are as follows:
m4 (fraud) = α4*(x) (13)
m4 (~ fraud) = 0 (14)
m4 (U) = 1 - m4 ( fraud ) = 1 – α4*(x) (15)
Let y=1.0 be the likelihood of ~fraud if t does not lie between Tstart and Tend. The mass
functions when the evidence supports ~fraud are as follows:
Texas Tech University, Abhishek Agarwal, August 2012
20
m4 (fraud) = 0 (16)
m4 (~fraud) = β4*(y) (17)
m4 (U) = 1 - m4 (~ fraud ) = 1 – β4*(y) (18)
Evidence 5: Place of origin of click
If the click originated from a location (country, state or city) where the advertiser has
no business then the user is likely to be a fraud. Ads are often targeted for audience of a
particular region where the advertisers have a reach or rights to sell their products. This is
especially true for small and medium sized businesses that are restricted to a country or city.
Even large advertisers mostly advertise to a local clientele such as a car company which sells
in many countries but has different ads based on the different models it sells in each country.
If a click originates from a location outside of advertiser‟s region of business then it is likely
to be fraud as the user will get no value from such a click. Also it is notable that in some
countries the laws against cyber frauds are very weak and this fact is utilized by fraudsters to
their advantage. Fraudsters use IP addresses originating from these countries through bots or
hiring people (many of whom do not realize that their act is causing huge losses to
advertisers) at low cost to carry out the fraud in order to avoid prosecution (Kshetri, 2010). As
a result such clicks have high suspicion associated with them. This rule has the ability to limit
a range of fraudulent attacks which depend on using IP addresses from varied geographical
locations (these include the use of both humans and bots). The BMA for this rule supports a
belief of fraud if the click originated from a region outside of advertiser‟s business and a
belief of ~fraud otherwise.
Let x=1.0 be the likelihood of fraud if the click originated from a region outside of
advertiser‟s business. The mass functions when the evidence supports fraud are as follows:
m5 (fraud) = α5 *(x) (19)
m5 (~ fraud) = 0 (20)
m5 (U) = 1 - m5 ( fraud ) = 1 - α5*(x) (21)
Texas Tech University, Abhishek Agarwal, August 2012
21
Let y=1.0 be the likelihood of fraud if the click originated from a region outside of
advertiser‟s business. The mass functions when the evidence supports ~fraud are as follows:
m5 (fraud) = 0 (22)
m5 (~fraud) = β5*(y) (23)
m5 (U) = 1 - m5 (~ fraud ) = 1 - β5*(y) (24)
Evidence 6: Creating of membership
If the user creates a membership account (register as member) with the advertiser, then
he/she is less likely to be a fraud. However he/she may or may not create such an account.
Fraudsters however are less likely to register themselves at the ad-site or create membership
account as they have no incentive in doing so and because it also requires them to spend some
time and give out some information like email, address etc. The BMA for this rule supports a
belief of ~fraud if a membership account was created, otherwise supports negligible belief of
fraud.
Let x=1 be the likelihood of fraud if a membership account is created. The mass functions when the
evidence supports fraud are as follows:
m6 (fraud) = α6* (x) (25)
m6 (~fraud) = 0 (26)
m6 (U) = 1 - m6 ( fraud ) = 1 - α6 *(x) (27)
Let y=1 be the likelihood of ~fraud if a membership account is not created. The mass functions
when the evidence supports ~fraud are as follows:
m6 (fraud) = 0 (28)
m6 (~ fraud) = β6 *(y) (29)
m6 (U) = 1 - m6 ( ~fraud ) = 1 - β6 *(y) (30)
Texas Tech University, Abhishek Agarwal, August 2012
22
Evidence 7: Adding a product in shopping cart
If the user adds a product to his shopping cart, then he/she is less likely to be a fraud.
Due to a lack of genuine interest in the advertiser‟s product or services, a fraudster is less
likely to use a shopping cart. Using a shopping cart requires the user to spend time for which a
fraudster has no incentive. The BMA for this rule supports a belief of ~fraud if a product was
added to a cart otherwise supports a negligible belief of fraud.
Let x=1.0 be the likelihood of fraud if the user does not add any product to his shopping cart. The
mass functions when the evidence supports fraud are as follows:
m7 (fraud) = α7* (x) (31)
m7 (~fraud) = 0 (32)
m7 (U) = 1 - m7 ( fraud ) = 1 – α7 *(x) (33)
Let y=1.0 be the likelihood of ~fraud if the user adds a product to his shopping cart. The mass
functions when the evidence supports ~fraud are as follows:
m7 (fraud) = 0 (34)
m7 (~ fraud) = β7*(y) (35)
m7 (U) = 1 - m7 ( ~fraud ) = 1 - β7*(y) (36)
Individually, the evidences are not sufficient in determining the likelihood of a user
being fraud or ~fraud. Each evidence may give different or contradicting belief of fraud
depending on their nature. But upon combination they provide a highly accurate estimate.
Thus, the likelihood of a click being fraudulent is estimated by combining the beliefs obtained
from corresponding mass functions for each of the supporting evidences. To define the rule
for combining mass functions, suppose m1 and m2 be two distinct mass functions of a
particular click. Dempster‟s rule of combination can be applied as shown below. For
readability, we omit i, and replace {fi}, {~fi} and Ui by f, ~f and U, respectively.
m1,2(f)= (m1(f)m2(f)+m1(f)m2(U)+m1(U )m2(f))(1K)
Texas Tech University, Abhishek Agarwal, August 2012
23
m1,2(~f)=(m1(~f)m2(~f)+m1(~f)m2(U)+m1(U)m2(~f))(1K)
m1,2(U)=(m1(U)m2(U ))(1K),
where K = m1(f)m2(~f) + m1(~f)m2(f).
This combination rule can be applied repeatedly pair-wise until evidence from all
clicks has been incorporated into the computation of the likelihood of each statement. Our
proposed approach certifies the clicks based on the corresponding likelihood of them being
fraudulent using the beliefs combined from all of the evidences. Table 4.1 below describes the
thresholds that we have empirically derived from our experiments and tests.
Table 4.1 Fraud certification rules
Lower Upper
Not Fraud 0 0.499
Suspicious 0.5 0.649
Fraud 0.65 1
A combined belief of fraud < 0.5 indicates ~fraud. A combined belief of fraud >= 0.65
indicates fraud and all values in between indicate a suspicion.
Texas Tech University, Abhishek Agarwal, August 2012
24
CHAPTER V
DATA SET & ILLUSTRATION
In this section we give a detailed explanation of the data that we use in our approach.
We also show an illustrated example using our data set with our approach.
Data Description
Click data is not publicly available. Any real weblog data from a web server is a
property of the owner of the server and are not made public due to privacy concerns by the
owner. Moreover such data need to be cleaned to extract data in relevant format. This is a
time consuming process and is not a focus of our research. For these reasons we use synthetic
data for our research. Furthermore we can manipulate synthetic data and add patterns of fraud
for evaluating different click fraud scenarios.
The data show weblog from the advertiser‟s web server. For our experiments and
evaluations we synthesize log data in combined log format (CLF). We pre-process the raw
logs and extract the following information from them for each user in real time: IP address of
the remote computer requesting the web page; time and date of request; the page that was
requested; and the Gclid number. The region from which the click originated can be easily
extracted from the IP address by using one of the many geo location services which map the
IP to a place using geo location database. The Table 5.1 below shows a sample data extracted
from the server logs.
Texas Tech University, Abhishek Agarwal, August 2012
25
Table 5.1 Sample log data
IP Address Click No Gclid No Time of click Requested
Page
Referrer
172.16.276.3 1 1001 3/5/2012 1:50 index.htm adsite.htm
172.16.276.3 2 1002 3/5/2012 1:56 index.htm adsite.htm
172.16.276.3 3 1002 3/5/2012 1:59 page1.htm index.htm
172.16.276.3 4 1002 3/5/2012 2:01 page2.htm page1.htm
172.16.276.3 5 null 3/5/2012 2:05 index.htm google.com
172.16.276.3 6 null 3/5/2012 2:08 page1.htm index.htm
172.16.276.3 7 null 3/5/2012 2:10 page2.htm page1.htm
172.16.276.3 8 null 3/5/2012 2:14 index.htm null
172.16.276.3 9 null 3/5/2012 2:16 page1.htm index.htm
172.16.276.3 10 null 3/5/2012 2:17 page2.htm page1.htm
Each row of the Table 5.1 above represents a HTTP request by the user made to the
advertiser‟s web server. Whenever a user requests content from the advertiser an HTTP
request is generated. Below are some observations which describe data represented by the
Table 5.1.
 Every row represents a click by the user requesting content from the ad-site.
 All the clicks in the table above are by the same user since the IP address is the same for
all rows of the log.
 Index.htm is the landing page. Every time index.htm is the requested page, it implies a
new visit. The Table 5.1 has 4 unique visits.
 A non-null Gclid number implies an ad-visit. Click numbers 1 through 4 belong to an ad-
visit since they have a valid Gclid number attached.
 Two different Gclid numbers above imply two different ad-visits. The first click with
Gclid number 1001 implies an ad-visit. Since there is only 1 row with Gclid number 1001,
it implies that the user did not make any other page requests after landing on the ad-site
during first ad-visit. The second click with Gclid number 1002 is also an ad-visit.
Texas Tech University, Abhishek Agarwal, August 2012
26
However in this visit the user requested page1.htm and page2.htm also (click number 3
and 4).
 Each row with a null Gclid number implies a non-ad visit. Click numbers 5 through 10
correspond to two non-ad visits.
 Click number 5 corresponds to first non-ad visit and the third visit overall. The visitor was
referred to the ad-site by Google search since google.com is the referrer. After landing the
user requested two more pages in the same visit, page1.htm and page2.htm.
 Click number 8 corresponds to second non-ad visit and fourth visit overall. A null referrer
implies that the user may have typed in the ad-site‟s URL in his browser or had previously
bookmarked the site and clicked on the bookmark. After landing the user requested two
more pages in the same visit, page1.htm and page2.htm.
Texas Tech University, Abhishek Agarwal, August 2012
27
We will use a timeline diagram to help illustrate our inputs (like Table 5.1) for the rest
of the thesis. Figure 5.1 shows the legends for the diagram and Figure 5.2 shows a timeline
diagram corresponding to the input from Table 5.1.
Figure 5.1 Legends for timeline diagram
Figure 5.2 Timeline diagram sample data in Table 5.1
A timeline diagram is a visual representation of a user‟s clicking data from the server
weblogs. Just by looking at Figure 5.2 we can easily make certain observations. The user has
made 4 unique visits. The first two visits were ad-visits and the last two were non-ad visits.
The width of the session blocks indicates session durations. The first visit was a very short
session in which the user did not request any pages after landing. In all the other visits the
user requested two other pages and the session durations are longer. The start and end times of
every session is also given. Lastly we can see that the user neither logged in as a member in
any of the sessions nor used a shopping cart.
Texas Tech University, Abhishek Agarwal, August 2012
28
Example of belief computation using mass function and combination
In this example we analyze and compute the belief of a user being fraud or ~fraud
using our approach. The purpose is to explain the approach and the computations involved
along with a simple example. The following is a sample input in Table 5.2 below.
Table 5.2 Input from server log
IP Address Click No Gclid No Time of click Requested Page Referrer
172.16.276.3 1 1001 3/5/2012 1:56 index.htm adsite.htm
172.16.276.3 2 1002 3/5/2012 2:01 index.htm adsite.htm
172.16.276.3 3 1003 3/5/2012 2:07 index.htm adsite.htm
172.16.276.3 4 1004 3/5/2012 2:13 index.htm adsite.htm
172.16.276.3 5 1005 3/5/2012 2:18 index.htm adsite.htm
172.16.276.3 6 1006 3/5/2012 2:23 index.htm adsite.htm
From Table 5.2 above we can easily conclude that the user made six ad-visits. The
user did not request any page of ad-site other than index.htm. Figure 5.3 below shows the
timeline diagram for the data corresponding to Table 5.2.
Figure 5.3 Timeline diagram for Table 5.2
As soon as a row is logged corresponding to a user activity, the system reads it
immediately and computes the mass beliefs for each piece of evidence which are then
combined to get an overall belief score using Dempster‟s combination rule. For the Table 5.2
Texas Tech University, Abhishek Agarwal, August 2012
29
above six belief values will be computed corresponding to every click. Thus the belief about
the user changes with every user click and is updated.
The evidence combination process combines beliefs from each conflicting evidence
and gives a belief score for a user‟s each click. To demonstrate our approach we will work out
the calculation of belief values at the 6th
click. Please note that we use the α and β values from
Table 5.3. These values have been derived empirically with our experiments and will be used
with all our computations.
Table 5.3 Coefficient values
Evidence No α β
1 0.8 -
2 - 0.99
3 0.6 0.2
4 0.2 0.01
5 0.4 0.1
6 0.02 0.25
7 0.01 0.2
Evidence 1 always supports a belief of fraud and therefore at the 6th
click on the ad the mass
function values are:
m1 (fraud) = 0.8* (1-1/6) = 0.667
m1 (~fraud) = 0
m1 (U) = 1 - m1* ( fraud ) = 1 – 0.8 *(1-1/6) = 0.332
Evidence 2 always supports a belief of ~fraud. The user spends 30 seconds in each visit since
he does not open any other page and therefore the total time spent is 180 seconds. The
window size W is 1800 seconds. Therefore the mass function values are:
m2 (~ fraud) = 0.99 *(180/1800) = 0.099
Texas Tech University, Abhishek Agarwal, August 2012
30
m2 (fraud) = 0
m2 (U) = 1 - m2 *(~ fraud ) = 1 – 0.99* (180/1800) = 0.901
Evidence 3 supports a little belief of fraud since there was no non-ad visit by the user.
Therefore the mass function values are:
m3 (fraud) = 0.6* (0.1) = 0.06
m3 (~ fraud) = 0
m3 (U) = 1 - m3 *( fraud ) = 1 – 0.6 *(0.1) = 0.94
Evidence 4 supports a belief of fraud since the 6th
click occurs at a suspicious time (2:23 AM).
Therefore the mass function values are:
m4 (fraud) = 0.2*(1) = 0.2
m4 (~ fraud) = 0
m4 (U) = 1 - m4 *( fraud ) = 1 - 0.2*(1) = 0.8
Evidence 5 supports a belief of fraud since we assume that the IP originates from a region
outside the area of business of the advertiser. Therefore the mass function values are:
m5 (fraud) = 0.4 *(1) = 0.4
m5 (~ fraud) = 0
m5 (U) = 1 - m5* (fraud) = 1 – 0.4 *(1) = 0.6
Evidences 6 and 7 support a little fraud since no product was added to a shopping cart and
neither was a membership account used. Therefore the mass function values are:
m6 (fraud) = 0.02 *(1) = 0.02
Texas Tech University, Abhishek Agarwal, August 2012
31
m6 (~fraud) = 0
m6 (U) = 1 - m7 *(fraud) = 1 – 0.02*(1) = 0.98
m7 (fraud) = 0.01* (1) = 0.01
m7 (~fraud) = 0
m7 (U) = 1 - m8* (fraud) = 1 – 0.01* (1) = 0.99
From Table 5.4 below we can observe that each mass function gives a varying degree
of belief values and these can be conflicting.
Table 5.4 Mass function beliefs for illustrated example
belief(fraud) belief(~fraud)
m1 0.667 0
m2 0 0.099
m3 0.06 0
m4 0.2 0
m5 0.4 0
m6 0.02 0
m7 0.01 0
Now we can apply the Dempster’s rule of combination to get the combined belief
about the user from the mass beliefs in Table 5.4.
K = m1(f)m2(~f) + m1(~f)m2(f) = 0.066
1-K = 0.934
m1,2(f) = m1(f)m2(f)+m1(f)m2(U)+m1(U )m2(f)/(1-K) = 0.643
m1,2(~f) =m1(~f)m2(~f)+m1(~f)m2(U)+m1(U)m2(~f)/(1-K) = 0.035
m1,2(U )= m1(U)m2(U )/(1-K) = 0.321
Texas Tech University, Abhishek Agarwal, August 2012
32
m1,2 is the combined mass belief from functions 1 and 2. Next we combine this with
mass functions for function 3 to get the combined mass belief m1,2,3
K = m1,2(f)m3(~f) + m1,2(~f)m3(f) = 0.0021
1-K = 0.998
m1,2,3(f) = m1,2(f)m3(f)+m1,2(f)m3(U)+m1,2(U )m3(f) = 0.664
m1,2,3(~f)= m1,2(~f)m3(~f)+m1,2(~f)m3(U)+m1,2(U)m3(~f) = 0.0333
m1,2,3(U ) = m1,2(U)m3(U ) = 0.303
The above belief combination repeats until no more evidence needs to be considered.
Thus, the belief of the hypothesis that click 6 is fraudulent is calculated in accumulative
fashion. Following the procedure we go on to get the combined belief of all mass beliefs
m1,2,3….7
m1,2,3….7(f) = 0.840
m1,2,3….7(~f) = 0.016
m1,2,3….7(U ) = 0.144
As we can clearly see, the belief (fraud) of 0.84 is clearly above the threshold for
fraud (0.65) given in Table 4.1 and so the user is certified as fraud. Figure 5.4 gives a
graphical representation of the combined belief of fraud over all the 6 clicks made by the user
(in this example we have worked out the mass value computation of 6th
click only but the
figure plots the mass values computed for all clicks from 1st
through 6th
). We can easily
observe how the combined belief changes as more clicks are made.
Texas Tech University, Abhishek Agarwal, August 2012
33
Figure 5.4 Combined belief of fraud for input in Figure 5.3
Texas Tech University, Abhishek Agarwal, August 2012
34
CHAPTER VI
EVALUATION
In this section we present two case studies (scenarios), each of which corresponds to a
different type of click fraud attack. In case study 1 we present a scenario where a human user
is trying to perform click fraud and uses different click patterns in order to avoid detection. In
case study 2 we present a scenario where a software bot is used to perform click fraud and it
tries to make detection difficult by using multiple IP addresses. In both the cases we present
our output and show that our approach is able to successfully detect click fraud. We will
discuss the generality of our solution in Chapter VII.
Case Study 1
We present a scenario where a human user is trying to commit click fraud and avoid
detection by giving an impression of a regular user. Figure 6.1 below show the user activity
for the test case.
Figure 6.1 Timeline input for Case Study 1
A fraudster needs to repeatedly click on the ad in order to make a substantial profit. In
this case the fraudster clicks the ad seven times (leading to seven ad-visits). The fraudster also
Texas Tech University, Abhishek Agarwal, August 2012
35
enters the ad-site via a regular search (non-ad visit) to give a stronger impression of a regular
user. He/she spends time on the site after landing (with random session durations) and carries
out activities like opening 32 links in the ad-site after landing, creating membership account
and adding a product to his shopping cart.
Below we describe the belief computed from every mass function and the combined
belief in figures 6.2 through 6.9. We have plotted the belief value with time (in the range of
window W). Please note that some of the functions support both fraud and ~fraud at different
times depending on the input and thus they can have both types of beliefs at different times. In
these cases we just show belief of fraud for the purpose of clarity. Also note that whenever a
function supports belief in ~fraud then the belief in fraud becomes 0 and vice versa.
Texas Tech University, Abhishek Agarwal, August 2012
36
Figure 6.2 below shows the belief computed from Mass Function 1 (Number of clicks
on the ad) according to which if the number of clicks on the ad from an IP in the time window
W (30 minutes) is high, then likelihood of the user being a fraud is high. Mass Function 1
supports only a belief of fraud and the belief at the first click on the ad is 0. The belief
increases as more clicks are made on the ad. The increase is faster in the first five clicks due
to the nature of the function. It is notable that the belief of fraud does not increase in the third
visit as it is a non-ad visit. This function does not consider any other user activity apart from
the number of clicks on the ad. Therefore user activities like a non-ad visit (third visit), adding
products to shopping cart etc. do not affect the belief of this mass function.
Figure 6.2 Belief of fraud from mass function 1
Texas Tech University, Abhishek Agarwal, August 2012
37
Figure 6.3 below shows the belief computed from Mass Function 2 (Time spent in
browsing) according to which if the time spent by the user at the ad-site is high then he/she is
less likely to be a fraud. This function supports only the belief of ~fraud. In this case study the
user spent time in every session and this is reflected in an increasing belief of ~fraud. This
belief clearly contradicts the belief from Mass Function 1 which supports a belief of fraud.
The fraudster has spent a considerable time browsing the ad-site during every visit to give an
impression of a genuine user. As we can see below the user has a high belief of ~fraud at the
end.
Figure 6.3 Belief of ~fraud from mass function 2
Texas Tech University, Abhishek Agarwal, August 2012
38
Figure 6.4 below shows the belief computed from Mass Function 3 (Ad-visit after
non-ad visit) according to which if a user clicks on an ad after a non-ad visit, he/she is likely
to be a fraud. Once a user makes a non-ad visit to the ad-site, it implies that the user is aware
how to reach the site apart from clicking on the ad. The first three visits are all ad-visits and
therefore the function supports a little belief of fraud. The fourth visit is a non-ad visit and
therefore the function does not support fraud (belief become 0). But the fifth visit is an ad-
visit (after non-ad visit). The function computes a high belief of fraud because of this and we
see that the belief of fraud spikes up to 0.6.
Figure 6.4 Belief of ~fraud from mass function 3
Texas Tech University, Abhishek Agarwal, August 2012
39
Figure 6.5 below shows the belief computed from Mass Function 4 (Time of click)
according to which if the click occurred in the most suspicious time (or most active period of
fraud activity) then the user is likely to be a fraud.. The first three visits are not during the
most suspicious time for fraud therefore the function does not support a belief of fraud.
During the fourth visit the session enters the suspicious time and therefore the function
supports fraud. The curve below shows this increased belief.
Figure 6.5 Belief of fraud from mass function 4
Texas Tech University, Abhishek Agarwal, August 2012
40
Figure 6.6 below shows the belief computed from Mass Function 5 (Place of origin of
click) according to which if the click originated from a location (country, state or city) where
the advertiser has no business then the user is likely to be a fraud. For this case study we
assume that the IP address of the user is from a region outside of the advertiser‟s region of
business. A click from such an IP is not natural and the advertiser will not benefit from it. The
function therefore supports a belief of fraud throughout and this value does not change at any
time.
Figure 6.6 Belief of fraud from mass function 5
Texas Tech University, Abhishek Agarwal, August 2012
41
Figure 6.7 below shows the belief computed from Mass Function 6 (Creation of
membership) according to which if the user creates a membership account (register as
member) with the advertiser, he/she is less likely to be a fraud. The user does not create any
membership or registration with the advertiser during the first three visits. However during
the fourth visit the user does create it and therefore this mass function changes its belief to
support ~fraud from 0 to 0.25.
Figure 6.7 Belief of ~fraud from mass function 6
Texas Tech University, Abhishek Agarwal, August 2012
42
Figure 6.8 below shows the belief computed from Mass Function 7 (Adding a product
to shopping cart) according to which if the user adds a product to his shopping cart, he/she is
less likely to be a fraud. The user does not use the shopping cart during the first three visits.
However during the fourth visit the user does add a product to it and therefore this mass
function belief to support ~fraud increases from 0 to 0.2.
Figure 6.8 Belief of ~fraud from mass function 7
Texas Tech University, Abhishek Agarwal, August 2012
43
The system combines the mass beliefs and a combined belief corresponding to each
click is computed. Table 6.1 below shows the computed values of belief, plausibility and
deduction for every click.
Table 6.1 Computed belief values for Case Study 1
click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction
1 0.45 0.99 0.015 0.55 not fraud
2 0.44 0.98 0.022 0.56 not fraud
3 0.44 0.97 0.027 0.56 not fraud
4 0.44 0.96 0.036 0.56 not fraud
5 0.43 0.96 0.043 0.57 not fraud
6 0.43 0.95 0.049 0.57 not fraud
7 0.65 0.96 0.036 0.35 suspect
8 0.64 0.96 0.041 0.36 suspect
9 0.64 0.95 0.052 0.36 suspect
10 0.63 0.94 0.06 0.37 suspect
11 0.63 0.93 0.068 0.37 suspect
12 0.7 0.94 0.059 0.3 fraud
13 0.69 0.93 0.067 0.31 fraud
14 0.69 0.93 0.072 0.31 fraud
15 0.69 0.92 0.078 0.31 fraud
16 0.68 0.91 0.092 0.32 fraud
17 0.67 0.9 0.1 0.33 fraud
18 0.59 0.81 0.19 0.41 suspect
19 0.51 0.7 0.3 0.49 suspect
20 0.5 0.69 0.31 0.5 suspect
21 0.43 0.6 0.4 0.57 not fraud
22 0.42 0.59 0.41 0.58 not fraud
23 0.42 0.58 0.42 0.58 not fraud
24 0.8 0.87 0.13 0.2 fraud
25 0.8 0.86 0.14 0.2 fraud
26 0.79 0.86 0.14 0.21 fraud
27 0.79 0.85 0.15 0.21 fraud
28 0.8 0.86 0.14 0.2 fraud
29 0.79 0.85 0.15 0.21 fraud
30 0.78 0.84 0.16 0.22 fraud
31 0.78 0.84 0.16 0.22 fraud
32 0.79 0.84 0.16 0.21 fraud
33 0.78 0.84 0.16 0.22 fraud
34 0.78 0.83 0.17 0.22 fraud
35 0.77 0.82 0.18 0.23 fraud
36 0.76 0.81 0.19 0.24 fraud
37 0.77 0.81 0.19 0.23 fraud
38 0.76 0.81 0.19 0.24 fraud
39 0.75 0.8 0.2 0.25 fraud
40 0.74 0.79 0.21 0.26 fraud
Texas Tech University, Abhishek Agarwal, August 2012
44
Figure 6.9 below shows the combined belief of fraud obtained by combining the
beliefs from all the mass functions using Dempster‟s combination rule. It is interesting to note
that individually the beliefs from mass functions contradict and give vary. However upon
combination they give correct belief which changes to reflect the changes in user‟s activity.
Figure 6.9 Combined belief of fraud for Case Study 1
Initially the combined belief of fraud is low and according to the threshold values in
Table 4.1 it indicates a ~fraud. As the user clicks again on the ad (second visit), the belief of
fraud increases and the user moves from ~fraud to suspicious. In the third ad-visit the belief of
fraud increases further and indicates a fraud. But as the user does a non-ad visit (fourth visit),
creates membership and uses shopping cart, the belief drops back to ~fraud. Had the user
stopped clicking on the ad at this point he/she would have been considered ~fraud. However
when the user clicks on ad again and makes an ad-visit (fifth visit) the belief increases to
Texas Tech University, Abhishek Agarwal, August 2012
45
support fraud. We see that the change in belief spikes to a high value during fifth visit because
this is an ad-visit after a non-ad visit. At the end the user‟s belief of fraud continues to be high
and this is certified as a case of fraud. Also the time of click and the location of the IP
contribute to the suspicion.
Case Study 2
This case study presents a scenario where a software bot is used to commit click fraud
by using different IP addresses at different times. Use of multiple IP addresses can make
detection difficult. In most approaches to click fraud detection including ours, n different IPs
will be considered n unique users. (Walgampaya et al., 2011) suggest a specialized approach
to identify bot attacks. For the ease of clarity let us now consider that each IP belongs to a
different user. Figure 6.10 below shows the activity from three different IP addresses (users)
in a timeline diagram. We have used a different color mechanism for this timeline diagram to
represent visits by three different IPs and do not show the time range of each session to avoid
cluttering.
Figure 6.10 Timeline diagram for Case Study 2
Texas Tech University, Abhishek Agarwal, August 2012
46
Using each IP, two ad-visits are made out of which the first visit has a short session and in the
second visit has longer sessions. The first two IPs are outside of the advertiser‟s region of
business but the third IP originates from the advertiser‟s area of business. Last four visits lie
in a suspicious time range.
The system computes mass beliefs and a combined belief corresponding to each click from
every IP. Tables 6.2, 6.3 and 6.4 below show the computed values of belief, plausibility and
deduction for first, second and third IPs respectively.
Table 6.2 Computed belief values for first IP
click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction
1 0.45 0.99 0.015 0.55 not fraud
2 0.66 0.99 0.014 0.34 fraud
3 0.72 0.98 0.025 0.28 fraud
Table 6.3 Computed belief values for second IP
click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction
1 0.45 0.99 0.015 0.55 not fraud
2 0.66 0.98 0.02 0.34 fraud
3 0.53 0.94 0.061 0.47 suspect
Table 6.4 Computed belief values for third IP
click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction
1 0.078 0.89 0.11 0.92 not fraud
2 0.73 0.99 0.0089 0.27 fraud
3 0.51 0.9 0.095 0.49 suspect
Texas Tech University, Abhishek Agarwal, August 2012
47
Figure 6.11 below shows the computed values of belief of fraud for all visits by the
bot using the three IPs.
Figure 6.11 Combined belief values for Case Study 2
From the Figure 6.11 and Tables 6.2 to 6.4 above we can observe that our system
detects the users with first two IPs as fraud and the user with the third IP as suspicious even
when there were just two clicks that occurred from each IP. The third IP was not outside of
advertiser‟s region of business and hence the system could conclude it as suspicious. The
above clicks from three different IPs could be from one single bot. We evaluate them as three
different users and yet detect the fraud.
Texas Tech University, Abhishek Agarwal, August 2012
48
CHAPTER VII
DISCUSSION & CONCLUSIONS
The thesis proposes an approach for click fraud identification that can be used by the
advertising community to solve their click fraud problems. Our approach is fundamentally
different from existing methods. First, we focus on the type of clicking activity, which can
create real value for the fraudster and attempt to detect that. For this we take raw weblog data
and derive meaningful evidences for our mass function formulization. Second, it has the
ability to do on-line computation to detect fraudulent clicks. Such computation adapts well to
real-time systems and this is a key advantage. Third, the approach is relatively simple and fast
because it requires only the incoming data at advertiser‟s disposal. It neither requires the
advertiser to maintain and update large historical databases of various evidences nor
necessitates learning of any patterns. This makes the approach beneficial for use by
advertisers. Fourth, the resulting beliefs also indicate the gray area of suspicious activity
which can alert the advertiser of irregular or abnormal traffic. This is useful against click
fraud attacks which may be hard to catch but still falls in suspicious category. Finally, the
approach suggests extraction of evidences from limited server data and can be extended easily
by adding new mass functions to represent additional evidence.
Our experiments on the two case studies show that the proposed approach works
correctly. Although we have not experimented on all possible scenarios of click fraud
behaviors we believe that our approach will work effectively in general because of the
following reasons. First, the technique allows combination of a set of evidences that can
contribute to click fraud detection. Second the set of evidences considered in this thesis is in
the worst case near complete. Finally, if the set is not complete, the technique can be easily
extended by adding new evidences into the proposed click fraud detection system.
Future work includes more experiments to gain understanding of the characteristics of
the proposed approach, for example, what are the novel click attacks which the approach fails
to identify and if found, what are the other sources of data and evidences that can be identified
to detect them. Future work also requires experiments to see if our approach works for
Texas Tech University, Abhishek Agarwal, August 2012
49
specialized bot attacks which can be highly sophisticated and evolve continuously. These are
among our ongoing and future research.
Texas Tech University, Abhishek Agarwal, August 2012
50
BIBLIOGRAPHY
D. Antoniou, M. Paschou, E. Sakkopoulos, E. Sourla, G. Tzimas, A Tsakalidis, E. Viennas,
“Exposing click-fraud using a burst detection algorithm”, in Proceedings of ISCC on
Computers and Communications, IEEE Symposium, Jun 2011, pp. 1111-1116.
A. Tuzhilin, “The Lane‟s Gifts vs. Google Report”, 2006
M. Kantardzic, C. Walgampaya, B. Wenerstorm, O. Lozitskiy, S. Higgins and D. Kings,
“Improving Click Fraud Detection by Real Time Data Fusion”, in Proceedings of the
ISSPIT on Signal Processing and Information Technology, IEEE International
Symposium, Dec. 2008, pp. 69-74.
G. Shafer, “A Mathematical Theory of Evidence”, Princeton University Press, 1976.
T. Denoeux, “ A K-nearest Neighbour Classification Rule based on Dempster-Shafer
Theory”, IEEE Transactions on Systems, Man and Cybernetics, 25 (1995) 804-813.
F. Dong, Sol. M. Shatz, H. Xu, “Reasoning Under Uncertainty For Shill Detection In Online
Actuions using Dempster Shafer Theory”, International Journal of Software Engineering
and Knowledge Engineering, 2010, pp. 943-973.
K. Sentz, S Ferson, “Combination of Evidence in Dempster-Shafer Theory”, SAND 2002-
0835, April 2002.
N. Kshetri, “The Economics of Click Fraud”, Security and Privacy, IEEE, May-June 2010,
pp. 45-53.
H. Haddadi, “Fighting Online Click-Fraud Using Bluff Ads”, ACM SIGCOMM Computer
Communication Review, v.40 n.2, April 2010 [doi>10.1145/1764873.1764877]
V. Anupam, A Mayer, K. Nissim, B. Pinkas, and M. K. Reither, “On the Security of pay-per-
click and other web advertising schemes”, Computer Netwroks, 31(11-16): 1999, 1091-
1100.
M. Kantardzic, C. Walgampaya, and H. Jamali, “Click fraud prevention in pay-per-click
model: Learning through multimodel evidence fusion”, in Proceedings of ICMWI of
Machine and Web Intelligence, 2010, pp. 20-27.
Texas Tech University, Abhishek Agarwal, August 2012
51
C. Walgampaya, and M. Kantardzic, “Cracking the Smart ClickBot”, in Proceedings of Web
Systems Evolution on 13th
IEEE Symposium, 2011, pp. 125-134.
B. J. Jansen, “Click Fraud”, IEEE Computer, vol. 40, no. 7, Jul 2007, pp. 85-86.
X. Li, Y. Liu, and D. Zeng, “Publisher click fraud in the pay-per-click advertising market:
Incentives and consequences”, in Proceeding of Intelligence and Security Inforatics of
IEEE International Conference, 2011, pp. 207-209.
S. Majumdar, D. Kulkarni, and C. V. Ravishankar , “Addressing Click Fraud in Content
Delivery Systems”, in Proceedings of INFOCOM 2007 of 26th IEEE International
Conference, May 2007, pp. 240-248.
A. Metwally, D. Agarwal, A. Abbadi, and Q. Zheng, “On Hit Inflation and Detection in
Streams of Web Advertising Networks”, in Proceedings of Distributed Computing
Systems on ICDCS, Jun 2007, pp. 52-52.
lAB, and PwC, “lAB Internet Advertising Revenue Report, 2010”, First Half-Year Results,
New York, U.S., 2011.
GeegkWire Magazine, “Newspapers take it on the chin as online ad revenue falls into the
hands of a few tech giants”, Mar 2012, http://www.geekwire.com/2012/newspapers-chin-
online-ad-revenue-falls-hands-tech-giants/
Google Earnings Report, “Google Announces Second Quarter 2011 Financial Results”, Jul
2011, http://investor.google.com/earnings/2011/Q2_google_earnings.html

More Related Content

Viewers also liked

A study of security in wireless and mobile payments
A study of security in wireless and mobile paymentsA study of security in wireless and mobile payments
A study of security in wireless and mobile paymentsJamal Meselmani
 
Flavio Felici Dissertation
Flavio Felici DissertationFlavio Felici Dissertation
Flavio Felici DissertationFlavio Felici
 
Global Computing: an Analysis of Trust and Wireless Communications
Global Computing: an Analysis of Trust and Wireless CommunicationsGlobal Computing: an Analysis of Trust and Wireless Communications
Global Computing: an Analysis of Trust and Wireless CommunicationsNicola Mezzetti
 
Theory simulation fabrication and testing of double negative an
Theory simulation fabrication and testing of double negative anTheory simulation fabrication and testing of double negative an
Theory simulation fabrication and testing of double negative anNgoc Hieu Quang
 
Challenges in VoIP Systems - Mostafa Ahmed Mostafa El Beheiry - First Draft F...
Challenges in VoIP Systems - Mostafa Ahmed Mostafa El Beheiry - First Draft F...Challenges in VoIP Systems - Mostafa Ahmed Mostafa El Beheiry - First Draft F...
Challenges in VoIP Systems - Mostafa Ahmed Mostafa El Beheiry - First Draft F...Mostafa El-Beheiry
 
Thesis_A Study on 3G Mobile Technology
Thesis_A Study on 3G Mobile TechnologyThesis_A Study on 3G Mobile Technology
Thesis_A Study on 3G Mobile TechnologyZareen Rahman
 
Detecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networksDetecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networksJamal Meselmani
 
Satellite hospitals
Satellite hospitalsSatellite hospitals
Satellite hospitalsnone
 
Final thesis paper Digital Optical fiber link design
Final thesis paper Digital Optical fiber link designFinal thesis paper Digital Optical fiber link design
Final thesis paper Digital Optical fiber link designMd. Nadimul Islam
 
Design horn-antenna using hfss
Design horn-antenna using hfssDesign horn-antenna using hfss
Design horn-antenna using hfssMusbiha Binte Wali
 
Wireless Accident Identification
Wireless Accident IdentificationWireless Accident Identification
Wireless Accident Identificationshivu1234
 
BITS MS- Dissertation Final Report
BITS MS- Dissertation Final ReportBITS MS- Dissertation Final Report
BITS MS- Dissertation Final ReportAnnie Sofia
 
1G vs 2G vs 3G vs 4G vs 5G
1G vs 2G vs 3G vs 4G vs 5G1G vs 2G vs 3G vs 4G vs 5G
1G vs 2G vs 3G vs 4G vs 5GBharathi Ravi
 
Construction of telecommunication towers
Construction of telecommunication towersConstruction of telecommunication towers
Construction of telecommunication towerssnookala
 
analysis and design of telecommunication tower
analysis and design of telecommunication toweranalysis and design of telecommunication tower
analysis and design of telecommunication towerRohithasangaraju
 

Viewers also liked (20)

A study of security in wireless and mobile payments
A study of security in wireless and mobile paymentsA study of security in wireless and mobile payments
A study of security in wireless and mobile payments
 
certi2
certi2certi2
certi2
 
Flavio Felici Dissertation
Flavio Felici DissertationFlavio Felici Dissertation
Flavio Felici Dissertation
 
Global Computing: an Analysis of Trust and Wireless Communications
Global Computing: an Analysis of Trust and Wireless CommunicationsGlobal Computing: an Analysis of Trust and Wireless Communications
Global Computing: an Analysis of Trust and Wireless Communications
 
MSc Thesis in University of Tehran
MSc Thesis in University of TehranMSc Thesis in University of Tehran
MSc Thesis in University of Tehran
 
Puvan Dissertation 2
Puvan Dissertation 2Puvan Dissertation 2
Puvan Dissertation 2
 
Theory simulation fabrication and testing of double negative an
Theory simulation fabrication and testing of double negative anTheory simulation fabrication and testing of double negative an
Theory simulation fabrication and testing of double negative an
 
Challenges in VoIP Systems - Mostafa Ahmed Mostafa El Beheiry - First Draft F...
Challenges in VoIP Systems - Mostafa Ahmed Mostafa El Beheiry - First Draft F...Challenges in VoIP Systems - Mostafa Ahmed Mostafa El Beheiry - First Draft F...
Challenges in VoIP Systems - Mostafa Ahmed Mostafa El Beheiry - First Draft F...
 
Thesis_A Study on 3G Mobile Technology
Thesis_A Study on 3G Mobile TechnologyThesis_A Study on 3G Mobile Technology
Thesis_A Study on 3G Mobile Technology
 
Detecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networksDetecting fraud in cellular telephone networks
Detecting fraud in cellular telephone networks
 
Satellite hospitals
Satellite hospitalsSatellite hospitals
Satellite hospitals
 
Final thesis paper Digital Optical fiber link design
Final thesis paper Digital Optical fiber link designFinal thesis paper Digital Optical fiber link design
Final thesis paper Digital Optical fiber link design
 
Design horn-antenna using hfss
Design horn-antenna using hfssDesign horn-antenna using hfss
Design horn-antenna using hfss
 
IP PBX
IP PBXIP PBX
IP PBX
 
Wireless Accident Identification
Wireless Accident IdentificationWireless Accident Identification
Wireless Accident Identification
 
BITS MS- Dissertation Final Report
BITS MS- Dissertation Final ReportBITS MS- Dissertation Final Report
BITS MS- Dissertation Final Report
 
Radar ppt
Radar pptRadar ppt
Radar ppt
 
1G vs 2G vs 3G vs 4G vs 5G
1G vs 2G vs 3G vs 4G vs 5G1G vs 2G vs 3G vs 4G vs 5G
1G vs 2G vs 3G vs 4G vs 5G
 
Construction of telecommunication towers
Construction of telecommunication towersConstruction of telecommunication towers
Construction of telecommunication towers
 
analysis and design of telecommunication tower
analysis and design of telecommunication toweranalysis and design of telecommunication tower
analysis and design of telecommunication tower
 

Similar to Automatic detection of click fraud in online advertisements

Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisOktay Bahceci
 
LinkedTV Deliverable 2.7 - Final Linked Media Layer and Evaluation
LinkedTV Deliverable 2.7 - Final Linked Media Layer and EvaluationLinkedTV Deliverable 2.7 - Final Linked Media Layer and Evaluation
LinkedTV Deliverable 2.7 - Final Linked Media Layer and EvaluationLinkedTV
 
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...Jason Cheung
 
Improving Organisational Agility
Improving Organisational AgilityImproving Organisational Agility
Improving Organisational AgilityDaniel Nortje
 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyAimonJamali
 
Resource scheduling algorithm
Resource scheduling algorithmResource scheduling algorithm
Resource scheduling algorithmShilpa Damor
 
Thesis_AMN_Final(typosCorrected)
Thesis_AMN_Final(typosCorrected)Thesis_AMN_Final(typosCorrected)
Thesis_AMN_Final(typosCorrected)Andy Nack
 
digiinfo website project report
digiinfo website project reportdigiinfo website project report
digiinfo website project reportABHIJEET KHIRE
 
11035624-Dissertation-MsC Information Technology (Final)
11035624-Dissertation-MsC Information Technology (Final)11035624-Dissertation-MsC Information Technology (Final)
11035624-Dissertation-MsC Information Technology (Final)Vy Quoc Tran
 

Similar to Automatic detection of click fraud in online advertisements (20)

Report
ReportReport
Report
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_Analysis
 
Final_Thesis
Final_ThesisFinal_Thesis
Final_Thesis
 
Upstill_thesis_2000
Upstill_thesis_2000Upstill_thesis_2000
Upstill_thesis_2000
 
Lesson 5...Guide
Lesson 5...GuideLesson 5...Guide
Lesson 5...Guide
 
Predictive Modeling and Analytics select_chapters
Predictive Modeling and Analytics select_chaptersPredictive Modeling and Analytics select_chapters
Predictive Modeling and Analytics select_chapters
 
LinkedTV Deliverable 2.7 - Final Linked Media Layer and Evaluation
LinkedTV Deliverable 2.7 - Final Linked Media Layer and EvaluationLinkedTV Deliverable 2.7 - Final Linked Media Layer and Evaluation
LinkedTV Deliverable 2.7 - Final Linked Media Layer and Evaluation
 
Mobile d
Mobile dMobile d
Mobile d
 
Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)
 
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
 
Milan_thesis.pdf
Milan_thesis.pdfMilan_thesis.pdf
Milan_thesis.pdf
 
Improving Organisational Agility
Improving Organisational AgilityImproving Organisational Agility
Improving Organisational Agility
 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italy
 
okafor2021.pdf
okafor2021.pdfokafor2021.pdf
okafor2021.pdf
 
Resource scheduling algorithm
Resource scheduling algorithmResource scheduling algorithm
Resource scheduling algorithm
 
Thesis_AMN_Final(typosCorrected)
Thesis_AMN_Final(typosCorrected)Thesis_AMN_Final(typosCorrected)
Thesis_AMN_Final(typosCorrected)
 
digiinfo website project report
digiinfo website project reportdigiinfo website project report
digiinfo website project report
 
Thesispdf
ThesispdfThesispdf
Thesispdf
 
11035624-Dissertation-MsC Information Technology (Final)
11035624-Dissertation-MsC Information Technology (Final)11035624-Dissertation-MsC Information Technology (Final)
11035624-Dissertation-MsC Information Technology (Final)
 
Lakhotia09
Lakhotia09Lakhotia09
Lakhotia09
 

More from Trieu Nguyen

Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdfBuilding Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdfTrieu Nguyen
 
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP - Spa and Hotel BusinessBuilding Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP - Spa and Hotel BusinessTrieu Nguyen
 
Building Your Customer Data Platform with LEO CDP
Building Your Customer Data Platform with LEO CDP Building Your Customer Data Platform with LEO CDP
Building Your Customer Data Platform with LEO CDP Trieu Nguyen
 
How to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDPHow to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDPTrieu Nguyen
 
[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDPTrieu Nguyen
 
Leo CDP - Pitch Deck
Leo CDP - Pitch DeckLeo CDP - Pitch Deck
Leo CDP - Pitch DeckTrieu Nguyen
 
LEO CDP - What's new in 2022
LEO CDP  - What's new in 2022LEO CDP  - What's new in 2022
LEO CDP - What's new in 2022Trieu Nguyen
 
Lộ trình triển khai LEO CDP cho ngành bất động sản
Lộ trình triển khai LEO CDP cho ngành bất động sảnLộ trình triển khai LEO CDP cho ngành bất động sản
Lộ trình triển khai LEO CDP cho ngành bất động sảnTrieu Nguyen
 
Why is LEO CDP important for digital business ?
Why is LEO CDP important for digital business ?Why is LEO CDP important for digital business ?
Why is LEO CDP important for digital business ?Trieu Nguyen
 
From Dataism to Customer Data Platform
From Dataism to Customer Data PlatformFrom Dataism to Customer Data Platform
From Dataism to Customer Data PlatformTrieu Nguyen
 
Data collection, processing & organization with USPA framework
Data collection, processing & organization with USPA frameworkData collection, processing & organization with USPA framework
Data collection, processing & organization with USPA frameworkTrieu Nguyen
 
Part 1: Introduction to digital marketing technology
Part 1: Introduction to digital marketing technologyPart 1: Introduction to digital marketing technology
Part 1: Introduction to digital marketing technologyTrieu Nguyen
 
Why is Customer Data Platform (CDP) ?
Why is Customer Data Platform (CDP) ?Why is Customer Data Platform (CDP) ?
Why is Customer Data Platform (CDP) ?Trieu Nguyen
 
How to build a Personalized News Recommendation Platform
How to build a Personalized News Recommendation PlatformHow to build a Personalized News Recommendation Platform
How to build a Personalized News Recommendation PlatformTrieu Nguyen
 
How to grow your business in the age of digital marketing 4.0
How to grow your business  in the age of digital marketing 4.0How to grow your business  in the age of digital marketing 4.0
How to grow your business in the age of digital marketing 4.0Trieu Nguyen
 
Video Ecosystem and some ideas about video big data
Video Ecosystem and some ideas about video big dataVideo Ecosystem and some ideas about video big data
Video Ecosystem and some ideas about video big dataTrieu Nguyen
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Open OTT - Video Content Platform
Open OTT - Video Content PlatformOpen OTT - Video Content Platform
Open OTT - Video Content PlatformTrieu Nguyen
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
 
Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems (Vietnam Web Submit)Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems (Vietnam Web Submit)Trieu Nguyen
 

More from Trieu Nguyen (20)

Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdfBuilding Your Customer Data Platform with LEO CDP in Travel Industry.pdf
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf
 
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP - Spa and Hotel BusinessBuilding Your Customer Data Platform with LEO CDP - Spa and Hotel Business
Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business
 
Building Your Customer Data Platform with LEO CDP
Building Your Customer Data Platform with LEO CDP Building Your Customer Data Platform with LEO CDP
Building Your Customer Data Platform with LEO CDP
 
How to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDPHow to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDP
 
[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP
 
Leo CDP - Pitch Deck
Leo CDP - Pitch DeckLeo CDP - Pitch Deck
Leo CDP - Pitch Deck
 
LEO CDP - What's new in 2022
LEO CDP  - What's new in 2022LEO CDP  - What's new in 2022
LEO CDP - What's new in 2022
 
Lộ trình triển khai LEO CDP cho ngành bất động sản
Lộ trình triển khai LEO CDP cho ngành bất động sảnLộ trình triển khai LEO CDP cho ngành bất động sản
Lộ trình triển khai LEO CDP cho ngành bất động sản
 
Why is LEO CDP important for digital business ?
Why is LEO CDP important for digital business ?Why is LEO CDP important for digital business ?
Why is LEO CDP important for digital business ?
 
From Dataism to Customer Data Platform
From Dataism to Customer Data PlatformFrom Dataism to Customer Data Platform
From Dataism to Customer Data Platform
 
Data collection, processing & organization with USPA framework
Data collection, processing & organization with USPA frameworkData collection, processing & organization with USPA framework
Data collection, processing & organization with USPA framework
 
Part 1: Introduction to digital marketing technology
Part 1: Introduction to digital marketing technologyPart 1: Introduction to digital marketing technology
Part 1: Introduction to digital marketing technology
 
Why is Customer Data Platform (CDP) ?
Why is Customer Data Platform (CDP) ?Why is Customer Data Platform (CDP) ?
Why is Customer Data Platform (CDP) ?
 
How to build a Personalized News Recommendation Platform
How to build a Personalized News Recommendation PlatformHow to build a Personalized News Recommendation Platform
How to build a Personalized News Recommendation Platform
 
How to grow your business in the age of digital marketing 4.0
How to grow your business  in the age of digital marketing 4.0How to grow your business  in the age of digital marketing 4.0
How to grow your business in the age of digital marketing 4.0
 
Video Ecosystem and some ideas about video big data
Video Ecosystem and some ideas about video big dataVideo Ecosystem and some ideas about video big data
Video Ecosystem and some ideas about video big data
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Open OTT - Video Content Platform
Open OTT - Video Content PlatformOpen OTT - Video Content Platform
Open OTT - Video Content Platform
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems (Vietnam Web Submit)Introduction to Recommendation Systems (Vietnam Web Submit)
Introduction to Recommendation Systems (Vietnam Web Submit)
 

Recently uploaded

Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

Automatic detection of click fraud in online advertisements

  • 1. Automatic Detection of Click Fraud in Online Advertisements by Abhishek Agarwal, M.S. A Thesis In COMPUTER SCIENCE Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE Approved Dr. Rattikorn Hewett Chair of Committee Dr. Sunho Lim Dr. Eunseog Youn Peggy Gordon Miller Dean of the Graduate School August, 2012
  • 2. Texas Tech University, Abhishek Agarwal, August 2012 ii ACKNOWLEDGMENTS I would like to thank Dr. Rattikorn Hewett for her guidance throughout my Master‟s research. Her in-depth knowledge of the subject, focus on clarity and quality of work has helped me learn skills which will help me for the rest of my career. Her guidance on the research is invaluable and has helped me cope with the challenges I faced throughout the course of this work.
  • 3. Texas Tech University, Abhishek Agarwal, August 2012 iii TABLE OF CONTENTS Acknowledgments ........................................................................................................ii Abstract......................................................................................................................... v List of Tables ............................................................................................................... vi List of Figures............................................................................................................. vii Motivation..................................................................................................................... 1 Contributions...................................................................................................... 5 Background Work........................................................................................................ 7 Preliminaries................................................................................................................. 9 Terms.................................................................................................................. 9 Problem Statement ........................................................................................... 10 Assumptions..................................................................................................... 10 Mathematical Theory of Evidence................................................................... 11 Mass Functions...............................................................................................12 Combination Rule ...........................................................................................14 Proposed Dempster Shafer Theory for Click Fraud Detection ............................. 16 The Core Element of Dempster Shafer Theory................................................ 16 Mass functions for Click Fraud Detection ....................................................... 17 Evidence 1: Number of clicks on the ad .........................................................17 Evidence 2: Time spent in browsing...............................................................18 Evidence 3: Ad-Visit after non-ad visit............................................................18 Evidence 4: Time of Click ...............................................................................19 Evidence 5: Place of origin of click .................................................................20 Evidence 6: Creating of membership..............................................................21 Evidence 7: Adding a product in shopping cart ..............................................22 Data Set & Illustration .............................................................................................. 24 Data Description............................................................................................... 24 Example of belief computation using mass function and combination ........... 28 Evaluation................................................................................................................... 34 Case Study 1..................................................................................................... 34 Case Study 2..................................................................................................... 45
  • 4. Texas Tech University, Abhishek Agarwal, August 2012 iv Discussion & Conclusions.......................................................................................... 48 Bibliography ............................................................................................................... 50
  • 5. Texas Tech University, Abhishek Agarwal, August 2012 v ABSTRACT Increasing advancement, access and availability of the Internet Technology have intensified the growth of Internet users over the last decade. This has made online advertising a popular venue for many companies to market their products and services. Today, online advertisement is one of the most important sources of revenues that impact the economy of many large enterprises. In online advertisement, an advertiser pays a broker (e.g., Google, Yahoo), who normally has a search engine, to post its online advertisement, which can be on any appropriate publisher site. The publisher earns revenues from the broker for each click on the advertisement posted on its site, while the advertiser will be charged. Thus, when an excessive number of clicks occur, this can quickly dry up the fund of a rival company and drive it out of the competing advertisement. At the same time, each click adds revenue to the publisher. This motivates click frauds, which refer to malicious acts to create fraudulent clicks with the intent to increase revenue or drive away competitors without real interest in the products or services being advertised. Identifying click frauds is a difficult problem because of the dynamic nature of the click behaviors, some of which are generated by humans and some are by automated software called bots. There have been previous work attempting to identify click frauds using various techniques but they tend to be limited by the types of the data, the way they are processing or assumptions that are not always achievable. This thesis presents an approach to automatically detecting click frauds in online advertising. The approach uses a mathematical theory of evidence to estimate the likelihood of a click whether it is fraud or genuine using web log data of a user‟s activities on the advertiser‟s website. One advantage of the proposed approach is the fact that the likelihood can be computed for each incoming click and thus it gives an online computation of the belief that fits well with the dynamic behaviors of users. The thesis describes the approach and evaluates its validity using two real-world case studies. We believe the approach is general in that it can be applied to any scenario.
  • 6. Texas Tech University, Abhishek Agarwal, August 2012 vi LIST OF TABLES 4.1 Fraud certification rules ....................................................................... 23 5.1 Sample log data.................................................................................... 25 5.2 Input from server log............................................................................ 28 5.3 Coefficient values................................................................................. 29 5.4 Mass function beliefs for illustrated example...................................... 31 6.1 Computed belief values for Case Study 1............................................ 43 6.2 Computed belief values for first IP ...................................................... 46 6.3 Computed belief values for second IP ................................................. 46 6.4 Computed belief values for third IP..................................................... 46
  • 7. Texas Tech University, Abhishek Agarwal, August 2012 vii LIST OF FIGURES 1.1 % change of revenue for advertising media (GeekWire, 2012)............. 1 1.2 Google‟s revenue source distribution in 2011 (Google Earnings Report, 2011) ......................................................................................... 2 1.3 Scenario before click fraud occurred ..................................................... 3 1.4 Scenario after click fraud occurred ........................................................ 4 4.1 Click fraud detection framework using D-S theory............................. 16 5.1 Legends for timeline diagram .............................................................. 27 5.2 Timeline diagram sample data in Table 5.1......................................... 27 5.3 Timeline diagram for Table 5.2 ........................................................... 28 5.4 Combined belief of fraud for input in Figure 5.3................................. 33 6.1 Timeline input for Case Study 1 .......................................................... 34 6.2 Belief of fraud from mass function 1 ................................................... 36 6.3 Belief of ~fraud from mass function 2................................................. 37 6.4 Belief of ~fraud from mass function 3................................................. 38 6.5 Belief of fraud from mass function 4 ................................................... 39 6.6 Belief of fraud from mass function 5 ................................................... 40 6.7 Belief of ~fraud from mass function 6................................................. 41 6.8 Belief of ~fraud from mass function 7................................................. 42 6.9 Combined belief of fraud for Case Study 1 ......................................... 44 6.10 Timeline diagram for Case Study 2 ..................................................... 45 6.11 Combined belief values for Case Study 2............................................ 47
  • 8. Texas Tech University, Abhishek Agarwal, August 2012 1 CHAPTER I MOTIVATION The Internet has seen tremendous growth in the last decade and according to current statistics from the World Bank, nearly 32% of the world population currently uses the Internet. This has made online advertising not only lucrative but also an important medium for businesses to reach out to a large consumer base (Jansen, 2007). Figure 1.1 below shows that while most other media of advertisement are losing market share, online advertisements are growing tremendously. Figure 1.1 % change of revenue for advertising media (GeekWire, 2012) Not only do online ads benefit advertisers, they are also a rich source of revenue for publishers who display ads on their websites and brokers like Google, Yahoo, MSN, Ask.com etc. who provide the technical platform for online advertisements. Thus, online ads drive the Internet economy and are the necessary life blood for its survival and growth. Figure 1.2 below shows that in 2011 97% of Google‟s revenue was from online ads alone.
  • 9. Texas Tech University, Abhishek Agarwal, August 2012 2 Figure 1.2 Google‟s revenue source distribution in 2011 (Google Earnings Report, 2011) Online advertising is however not free of issues and click fraud is a major problem which can impact its growth. Click fraud is a type of crime in online advertisement in which a user clicks on an ad not with a genuine interest in what the advertiser has to offer but with intent of either generating illegal revenue (for the publisher that hosts the advertisement) from clicks or to intentionally cause monetary loss to the advertiser. It hurts the advertisers and may deter them from investing in online ads. Many advertising mechanisms exist including the pay-per-click (PPC) scheme which contributes to about 57 percent of all the Internet ads with nearly more than US$16 billion in revenue in 2010 (Tuzhilin, 2006; IAB and PwC, 2010). A popular example of PPC scheme is the Google Adsense. In PPC brokers like Google place targeted ads in dedicated ad spaces on publisher websites. Brokers get paid by advertisers for every click on the ad and they share the income generated this way with the publishers. While PPC is a great model for online advertisement, it suffers the most from the problem of click fraud (Tuzhilin, 2006). Most of the publishers in PPC programs are small time blog owners and are the source of majority of the click fraud. Competitors of an advertiser can also commit click fraud in order to reduce competition and it may indirectly benefit their business. To commit click fraud, publishers or
  • 10. Texas Tech University, Abhishek Agarwal, August 2012 3 competitors can click on the ad themselves, ask friends to do it, use an Internet bot script which repeatedly clicks on the ads or hire people to do it for them (Kshetri, 2010). Such clicks are of no value to the advertisers as the clicker has no intent to buy their product or service, use information or carry out any transaction useful to the advertiser‟s business (Jansen, 2007). The brokers too have an incentive in not filtering out all the click fraud as doing so will reduce their revenues. They can contribute to click fraud by passively letting the fraud happen and not taking adequate measures to stop it. The lesser known brokers have a greater incentive in doing so (Kshetri, 2010). Multiple lawsuits filed by various advertisers against Google and Yahoo for not taking adequate steps to curb click fraud are an indication of brokers‟ inability or unwillingness in this regard. Figure 1.3 below shows a scenario before click fraud when the advertiser‟s money reserve (advertising budget) is full. The publisher, broker or competitors have not generated any illegal revenue from click fraud. Figure 1.3 Scenario before click fraud occurred Figure 1.4 below shows the scenario after click fraud which caused advertiser‟s budget to completely deplete and the broker, publisher and competitor‟s illegal profit to increase.
  • 11. Texas Tech University, Abhishek Agarwal, August 2012 4 Figure 1.4 Scenario after click fraud occurred Reputed brokers like Google actively try to contain click fraud by filtering out fraudulent clicks and permanently blocking publishers who are found involved (Tuzhilin, 2006; Kshetri, 2010). They have access to a user‟s search activities and the data they collect from the publisher to find patterns in a user‟s behavior. The idea is to estimate a user‟s intention behind the click in order to rate a click as genuine or fraudulent. However they may not have access to the data about a user‟s actions on the advertiser‟s website where the user is taken following the click. This is because the advertiser may choose to share limited or no data at all with the broker due to their own privacy concerns (Tuzhilin, 2006). Brokers provide aggregate statistics to advertisers and do not share details on which clicks they found fraudulent in order to avoid making their detection mechanisms open to fraudsters. Thus advertisers are not adequately informed and there is a strong case for the advertisers to have their own click fraud detection system in place. This way the advertisers can protect themselves not only from fraudulent publishers and competitors but also from brokers who either fail to detect fraud or let it occur willingly. Such a system can help them estimate the extent of the fraud in their ad campaign and pay the brokers for genuine clicks only. It is important to note here that brokers have access to much larger sources of information than advertisers. The advertisers must be able to do the click fraud detection with the limited data they have about users‟ action at their website.
  • 12. Texas Tech University, Abhishek Agarwal, August 2012 5 Click fraud identification is a difficult problem to solve. Fraud mechanisms evolve and continually change over time. The fraud can be carried out both by humans and software bots with distinctive characteristic behaviors. It is difficult to track users with their IP addresses as IPs are generally dynamic in that a IP address of the same user may change anytime. A software bot too can use different IP addresses at a time to carry out click attacks. Finally, the advertiser has access to data from their server, which gives very limited information about a user‟s behaviors. Contributions This paper presents an approach to automatically detecting click fraud at the ad-site. The advertisers can use the proposed approach to detect their click frauds. Our approach employs the mathematical theory of evidence called Dempster-Shafer (DS) Theory (Shafer, 1976; Denoeux, 1995; Dong et al., 2010; Sentz et al., 2002) for evidence-based reasoning to estimate the likelihood of a click being fraudulent based on the evidence gathered from the weblog data available to the advertiser. The proposed approach can also be useful for brokers for computing correct charges to their clients if the data are available to them. Our approach is based on a widely used theory that allows the estimate of the likelihood to be computed as each incoming click is exhibited. That is it offers an on-line computation. Thus, after each click from a given IP we can estimate our belief if the click is suspicion to be fraudulent or not. In summary the contributions of this thesis include: (1) an approach for automatically detecting or identifying click frauds, (2) a framework for reasoning about click frauds that integrates relevant information extracted from weblog data with the evidence based reasoning to update click fraud analysis in real-time, and (3) core elements of the proposed approach that consists of a set evidences required in detecting click frauds. These evidences will be formulated in terms of functions called mass functions used in the DS theory. The rest of this thesis is organized as follows: Chapter II presents background work on click frauds identification. Chapter III gives preliminaries including terms and relevant concepts, the problem formulation and its assumption, and the Dempster-Shafer Theory along with its fundamental elements. Chapter IV presents our approach to the problem and the details of the core contribution on formulating mass functions for click fraud identification
  • 13. Texas Tech University, Abhishek Agarwal, August 2012 6 problem. Chapter V explains the data set used for the approach and gives an illustrative example. Chapter VI evaluates the proposed approach with experiments on synthetic data generated on two case studies. Chapter VII gives concluding remarks and possible extension for future work.
  • 14. Texas Tech University, Abhishek Agarwal, August 2012 7 CHAPTER II BACKGROUND WORK Many different types of solutions have been proposed to counter click fraud. (Tuzhilin, 2006) suggested a model where the advertisers pay for a click only if it leads to a conversion event like purchase to counter CF. Such a model is economically unviable for publishers and so is not available to advertisers. Another method proposed (Tuzhilin, 2006) is the use of data mining models based on past data to classify clicks as fraud or ~fraud (not fraud). Such a solution may suffer from high inaccuracy as fraud mechanisms evolve and change over time. There is an assumption that past clicking behavior is indicative of future behavior. A large number of past clicks which can be truly classified as valid or invalid are also required. This is a batch process and not online. Moreover such datasets are at the disposal of brokers only and other involved parties like advertisers cannot use them. The author clearly states these limitations. (Haddadi, 2010) discusses the use of bluff ads for detecting sources of click fraud like trained bots or poorly trained human workforce employed to carry out fraud. The display text of these ads is unrelated to the context of the user to whom they are displayed. For example a user in Australia should not ideally be shown an ad of a special offer on pizza in New York City. A click by the user is unnatural in this case and will indicate that the user is a bot or human involved in fraud. However careful humans and sophisticated bots can still beat it. Also this is a „broker-centric‟ model. This can be implemented by brokers and advertisers need to completely trust brokers in this. Recently (Antoniou et al., 2011) proposed a burst detection algorithm to detect high frequency of user activity in short time periods to detect various types of click frauds including voting click fraud, frauds related to blog post popularity, search engine retaliation and advertising click fraud. While this is a good general solution for all types of click frauds mentioned, it does not cater to the nuances of advertisement click fraud, as a simple detection of bursts may not be enough to differentiate between valid and invalid clicks. More
  • 15. Texas Tech University, Abhishek Agarwal, August 2012 8 factors/evidences need to be taken into consideration before we could conclusively label a click as fraudulent. (Walgampaya et al., 2011) proposed a method to detect bot scripts involved in click fraud using Bayesian Classifiers. The methods above are either not sufficient to combat the problem of click fraud individually or require broker involvement of some kind. The involvement may be in the form of policy changes by brokers or sharing data at their disposal and they have been unwilling for both. As a result they cannot be used by advertisers to actively detect fraud at their site. (Kantardzic et al., 2010) proposed a real time click fraud detection and prevention system. It uses D-S Theory for multilevel data fusion of evidences from different sources like IP address, referrer, country etc. However they rely on data from both the client (advertiser) and server (broker). An advertiser does not have access to broker‟s data and hence this system is limited to be used by brokers only. Our approach equips advertisers with a fraud detection system using only the data at their disposal. The evidences that they extract from server data to formulate mass functions are very basic whereas some of our rules are sophisticated and novel to the best of our knowledge. We do not maintain any historical databases and exploit the fact from (Antoniou et al., 2011) that fraud will happen in bursts. Our approach is simple yet our set of rules is powerful and comprehensive making it difficult for fraudsters to carry out any viable attacks on the advertiser. For example, rules 1, 2, 4 and 5 make it difficult for a bot to generate clicks without detection.
  • 16. Texas Tech University, Abhishek Agarwal, August 2012 9 CHAPTER III PRELIMINARIES This section outlines the foundation for the proposed method of click fraud detection and the assumptions we have taken. Terms We now define terms used in this thesis.  Advertiser is a seller with an e-commerce website who pays for his ads to be displayed on other sites. These ads may create more traffic and revenue for the advertisers since a user who clicks on these ads is directed to their site.  Ad-site is the advertiser‟s website. A user on the Internet can visit the ad-site by several means like using an Internet search, typing the URL of the advertiser on their browser, bookmark the advertiser and clicking it later or clicking on the ad on a publisher site.  Ad-visit is a visit of a user to ad-site by clicking an ad. Non-ad visit is a user visit by any means other than clicking an ad.  Session is a continuous period of time that a visitor navigates within the advertiser‟s site. In other words it is the duration for which a user maintains an active HTTP connection with the server. In a session the user can be browsing, reading, watching videos, filling out forms, registering for membership, adding products in a shopping cart, purchase products etc.  Publishers are the websites which hosts ads for the advertisers and get paid for the click on those ads. Common examples are blogs and news sites.  Broker is an intermediary between advertiser and publisher. They provide the technical platform for online advertisements. They are mostly Internet search engine companies like Google, Yahoo, AOL, Ask.com etc. and use their search technology to serve targeted ads on publisher sites based on website content, geographical location etc..
  • 17. Texas Tech University, Abhishek Agarwal, August 2012 10  Pay Per Click (PPC) is an online advertising model in which publishers display ads on their websites and get paid for each click on those ads. Google runs a PPC program called Adsense.  Gclid is a unique ID called that is attached to the server log for every click that was made on Google ads. This helps identify unique visitors to the best approximation as Google uses various parameters to make this unique identification. Problem Statement Given a weblog data at the site of the advertiser over a period of time, find all occurrences of click fraud. For every such occurrence, identify its owner by its corresponding IP address. The advertiser‟s web server log data has information such as IP address, date & time, Gclid number (to be described later), a requested page and referrer for every click. Assumptions Due to the dynamic natures of IP addresses associated to each user, to solve the above problem in real practice, it is necessary to make the following assumptions. 1) IP addressing changes over time and a user may be assigned to different IP addresses while he/she is surfing the Internet. A user (either human or bot) may try to carry out fraudulent clicks using as many different IPs as possible in order to avoid detection. Therefore it is not feasible to use a long duration data of an IP. Instead we use a short duration of a window W. In this work, W is specified to be 30 minutes during which we assume that the IP address for a user will not change. This duration is typical and is reasonable though is quite different from other existing work. The probability that a user with a particular IP clicked on an ad and that the same IP is assigned to another user who also clicks on the same ad within the proposed window is negligibly low. Our approach is however not limited by this window size and one can pick a size that suits them well. 2) A fraudster has an incentive in clicking on an ad multiple times but no intention in making an actual purchase of a product or service. Fraudsters will make money on clicking on the ads but will have to spend money to make purchases and this is strictly against their end goal. Thus, if a user makes a purchase at the ad-site, we assume that the user is not
  • 18. Texas Tech University, Abhishek Agarwal, August 2012 11 involved in fraud. However in some circumstances (like in order to confuse detection systems), the fraudster may make a purchase. Such an action will not help the fraudster as soon as he moves out of the time window W. 3) Fraudulent clicks with large time gaps in between every two clicks do not deliver any substantial monetary gain to the fraudster. The number of clicks has to be large enough with shorter gaps between them and therefore, a burst of clicks may indicate Click-Fraud (Antoniou, 2011). 4) Since HTTP is a stateless protocol it is difficult to accurately estimate the session duration. We sum the time difference between consecutive HTTP requests by the user to get the total session time but however there is no way to compute the exact time spent by the user viewing the last page since there is no request after that. We thus had to make an assumption that 30 seconds was spent on the last page. Our approach is however not limited by this assumption and any other duration can be assumed for the last page view. 5) We modeled our approach around Google‟s Adsense as it is the most widely accepted Pay Per Click program. We use gclid, a unique id attached by Google to the web server logs of advertisers for every click that was made on their ads. It follows Google‟s definition of unique visits. Google claims that it uses various parameters to assign unique gclids and third party CF detection engines which use the gclid are more accurate than others. So we take data filtered by the broker (Google) and apply our own approach for further filtration. However our approach can be modeled around any other PPC program and the way to identify the clicks that were made on advertisements could be by creating unique landing pages. This way by looking at server logs we can separate out visits made from ads. Mathematical Theory of Evidence Efforts in identifying click fraud have mostly concentrated on identifying a certain characteristic of user behavior and this is quite different from our approach. To provide a theoretical background of our approach we describe the mathematical theory of evidence also known as the Dempster-Shafer (D-S) Theory (Shafer, 1976; Denoeux, 1995; Dong et al., 2010; Sentz et al., 2002). It is related to traditional probability and set theory but is not the
  • 19. Texas Tech University, Abhishek Agarwal, August 2012 12 same. The D-S theory allows probability assignment to a set of atomic elements rather than an atomic element and it can be used to represent not only the likelihood of occurrence of an event but also the uncertainty associated with it. Using the D-S Theory evidence, which is coming from multiple sources with varying level of certainty, can be effectively combined online. Its ease of use combined with a wide and successful application in many areas makes it an ideal candidate for application in click fraud detection which requires a complex model with several evidences. In our problem domain a user can either be a fraud or not a fraud (~fraud). So we have a finite set of hypothesis (atomic elements) in the problem domain U = {fraud, ~fraud}. The power set of U is a set {{fraud}, {~fraud}, {U}, {}}. Each of the four elements in the power set represents a belief between 0 and 1. {fraud} represents a belief of the user being a fraud; {~fraud} represents the belief of the user being not fraud; U represents the belief of user being both fraud and ~fraud and thus it represents the uncertainty;  is an empty (null) set and it represents a contradiction, thus it is always 0. DS-Theory assigns belief to all the elements of this power set of U rather than mutually exclusive events of U. The sum of all belief values in the power set of U is 1. Mass Functions A degree of belief is represented as a belief function called mass function m which provides a probability assignment to any AU, where m() = 0 and m(fraud) + m(~fraud) + m(U) = 1. m() = 0 m(fraud) ∈ [0, 1] m(~fraud) ∈ [0, 1] m(U) ∈ [0, 1] X Am(X) = 1
  • 20. Texas Tech University, Abhishek Agarwal, August 2012 13 The mass m(A) represents a belief exactly on A. For example, U = {faulty, ~faulty} represents a hypotheses of a suspect being both faulty and non-faulty. A situation in which m({fraud, ~fraud}) = 1 occurs where there is no certainty regarding an evidence at all and this cannot be adequately represented with traditional probability theory. A belief mass is therefore different from probability. As we see above the probabilities are being assigned to sets rather than mutually exclusive singletons (Shafer, 1976; Sentz et al, 2002). When the probabilities are assigned to mutually exclusive events i.e. either fraud or ~fraud such that m(U) is always 0 then DS-Theory becomes same as probability theory. For every mass function, there are associated functions of belief and plausibility. The degree of belief on A, bel(A) and the plausibility of A, pl(A) defined to be respectively: bel(A) = X Am(X) pl(A) = 1 – bel(~A) =X  A   m(X). For example, bel({fraud}) = m({fraud}) + m() = m({fraud}). In general, bel(A) = m(A) for any singleton set AU and in such a case the computation of bel is greatly reduced. However, bel(A) is not necessary the same as m(A) when A is not a singleton set. Thus, m, bel and pl can be derived from one another. Thus, belief and probability are different measures. In this thesis, we use the terms likelihood and belief synonymously. For our approach we use multiple evidences each of which contributes to either a belief (or disbelief) that a user is a fraud depending on the nature of the evidence and its quantified value (Dong et al., 2010). For example, if a user clicks many times on an ad, it becomes evidence that the user is a fraud. Each evidence can support a user for either fraud or ~fraud but not both. If an evidence for a user supports fraud, the rest of the belief from the evidence cannot commit only to the universal set U which quantifies the uncertainty. If evidence i supports that the user is fraud then the mass functions for the evidence are defined as follows: mi(fraud) = α*f mi (~fraud) = 0
  • 21. Texas Tech University, Abhishek Agarwal, August 2012 14 mi (U) = 1 - α*f Where 0 < α < 1, is an empirically derived value that signifies the strength of the evidence in supporting the user is fraud. 0 < f < 1, is a function that is used to quantify the evidence. If evidence i supports that the user is ~fraud then the mass functions for the evidence are defined as follows: mi(fraud) = 0 mi (~fraud) = β*g mi (U) = 1 - β*g Where 0 < β < 1, is an empirically derived value that signifies the strength of the evidence in supporting the user is ~fraud. 0 < g < 1, is a function that is used to quantify the evidence. Combination Rule Since we have multiple mass functions, we need a way to combine them. A mass function can be combined using various rules including the popular Dempster’s Rule of Combination, which is a generalization of the Bayes rule. For X, A, BU, a combination rule of mass functions m1 and m2, denoted by m1m2 (or m1, 2) is defined as the following: where K = and m1m2 () = 0 The combination rule can be applied in pairs repeatedly to obtain a combination of multiple mass functions. The above rule strongly emphasizes the agreement between multiple sources of evidence and ignores the disagreement by the use of a normalization factor.  m1AB (A)m2(B)  m1,2 ( X )  m1  m2 ( X )  m1AB X ( A)m2 (B) 1 K
  • 22. Texas Tech University, Abhishek Agarwal, August 2012 15
  • 23. Texas Tech University, Abhishek Agarwal, August 2012 16 CHAPTER IV PROPOSED DEMPSTER SHAFER THEORY FOR CLICK FRAUD DETECTION We propose an approach that can be used by the advertisers to detect fraud in real time using data available to them, without any data from the broker which can either be impossible to acquire or very limited if at all possible. This section describes our approach in detail and the mass functions that have been developed to compute the belief of fraud. The Core Element of Dempster Shafer Theory Figure 4.1 below shows the framework elements of click fraud detection using our approach. A user‟s clicking activity is captured by the advertiser‟s web server logs. The server logs are updated in real time as users request pages from the server and the click fraud detection system reads this data as soon as it is logged. For a latest click that the system is processing, it finds the IP address and reads all the log data from that IP in the window W. This data is pre-processed to extract out meaningful Figure 4.1 Click fraud detection framework using D-S theory
  • 24. Texas Tech University, Abhishek Agarwal, August 2012 17 evidences and then formulated into various mass functions. Each mass function computes a belief of fraud which is unique and can conflict with the beliefs from other mass functions. These beliefs are combined using Dempser‟s combination rule. The combined belief is categorized into fraud, ~fraud or suspicious by using a set of threshold values. This process is repeated for every new user click. Mass functions for Click Fraud Detection Using the user behavior from the weblogs at the advertiser‟s site as evidences to reason about click fraud we formulate mass functions based on each of such core evidence. These evidence are contributed by various factors such as number of clicks on the ad, time spent browsing the advertiser site etc. The mass functions are used to compute belief value on the click being fraud or not fraud (~fraud). The belief value from different evidences is combined as each of them occurs in the data. A mass function contributes to either a belief (or disbelief) that a user is a fraud depending on its nature and its quantified value. The following gives detailed formulae of mass functions based on each evidence. The values αi and βi for evidence i represent the strength of the evidence in mass function formulation (mi). In practice these values will be empirically derived. Evidence 1: Number of clicks on the ad If the number of clicks on the ad from an IP in the time window W (30 minutes) is high, then likelihood of the user being a fraud is high. Fraudsters have a natural incentive of making more money by clicking the ads many times in a short period of time (short bursts). The more they click, the more illegal revenue they generate for themselves. The Basic Mass Assignment (BMA) for this evidence will always support a belief of fraud whose value depends on the number of clicks. Let n be the number of clicks in the window W. Likelihood of the fraud = 1 – 1/n m1( fraud) = α1 (1-1/n) (1)
  • 25. Texas Tech University, Abhishek Agarwal, August 2012 18 m1 (~fraud) = 0 (2) m1 (U) = 1 - m1 ( fraud ) = 1 – α1 (1-1/n) (3) Evidence 2: Time spent in browsing If the time spent by the user at the ad-site is high then he/she is less likely to be a fraud. A genuine user will click the ad due to a real interest in advertiser‟s content (advertised product, service or website content) and is likely to spend more time exploring the ad-site than a fraudster. Fraudsters are less likely to do so since they are not interested in the product and so that they could do more clicks in a given time. The BMA for this rule will always support a belief of ~fraud whose value depends on the time spent at the ad-site. As a user continues to spend more time at the ad-site the belief that he is ~fraud will increase. Let t be the time spent by the user in all visits in the time window W (30 minutes) where 0 < t <= 30 minutes. The likelihood of ~fraud increases as t increases. m2 (fraud) = 0 (4) m2 (~ fraud) = β2 *(t/W) (5) m2 (U) = 1 - m2 (~ fraud ) = 1 – β2* (t/W) (6) Evidence 3: Ad-Visit after non-ad visit If a user clicks on an ad after a non-ad visit, then he is likely to be a fraud. Once a user makes a non-ad visit to the ad-site, it implies that the user is aware how to reach the site apart from clicking on the ad. Clicking on an ad after that seems unnecessary and indicates a likelihood of fraud. The BMA for this rule can support a belief of either fraud or ~fraud behavior. Let x be the likelihood of fraud. If the user has visited only via ads then x=0.1 (little likelihood of fraud). If the user has visited via ads after visiting normally then x=1.0 (high likelihood of fraud). Thus the mass functions when the evidence supports fraud are as follows:
  • 26. Texas Tech University, Abhishek Agarwal, August 2012 19 m3 (fraud) = α3 *(x) (7) m3 (~ fraud) = 0 (8) m3 (U) = 1 - m3 ( fraud ) = 1 - α3*(x) (9) Let y=1.0 be the likelihood of ~fraud if the user does not have an ad-visit after a non-ad visit. The mass functions if the evidence supports ~fraud are as follows: m3 (fraud) = 0 (10) m3 (~ fraud) = β3 *(y) (11) m3 (U) = 1 - m3 ( ~fraud ) = 1 – β3 *(y) (12) Evidence 4: Time of Click If the click occurred in the most suspicious time (or most active period of fraud activity) then the user is likely to be a fraud. Fraudsters are generally known to be active during certain hours of the day and a click at such hours can be indicative of fraudulent activity. We follow Universal Time to determine this and not any particular time zone. If a click happens at that certain time slot of suspicion then the click is likely to be a fraud otherwise ~fraud. The BMA for this rule will support a belief of fraud if the time of click lies in the suspicious time range. Otherwise it will support a belief of ~fraud. Let Tstart and Tend be the start and end of the suspicious time range, t be the time of click. Let x=1.0 be the likelihood of fraud if t lies between Tstart and Tend. The mass functions when the evidence supports fraud are as follows: m4 (fraud) = α4*(x) (13) m4 (~ fraud) = 0 (14) m4 (U) = 1 - m4 ( fraud ) = 1 – α4*(x) (15) Let y=1.0 be the likelihood of ~fraud if t does not lie between Tstart and Tend. The mass functions when the evidence supports ~fraud are as follows:
  • 27. Texas Tech University, Abhishek Agarwal, August 2012 20 m4 (fraud) = 0 (16) m4 (~fraud) = β4*(y) (17) m4 (U) = 1 - m4 (~ fraud ) = 1 – β4*(y) (18) Evidence 5: Place of origin of click If the click originated from a location (country, state or city) where the advertiser has no business then the user is likely to be a fraud. Ads are often targeted for audience of a particular region where the advertisers have a reach or rights to sell their products. This is especially true for small and medium sized businesses that are restricted to a country or city. Even large advertisers mostly advertise to a local clientele such as a car company which sells in many countries but has different ads based on the different models it sells in each country. If a click originates from a location outside of advertiser‟s region of business then it is likely to be fraud as the user will get no value from such a click. Also it is notable that in some countries the laws against cyber frauds are very weak and this fact is utilized by fraudsters to their advantage. Fraudsters use IP addresses originating from these countries through bots or hiring people (many of whom do not realize that their act is causing huge losses to advertisers) at low cost to carry out the fraud in order to avoid prosecution (Kshetri, 2010). As a result such clicks have high suspicion associated with them. This rule has the ability to limit a range of fraudulent attacks which depend on using IP addresses from varied geographical locations (these include the use of both humans and bots). The BMA for this rule supports a belief of fraud if the click originated from a region outside of advertiser‟s business and a belief of ~fraud otherwise. Let x=1.0 be the likelihood of fraud if the click originated from a region outside of advertiser‟s business. The mass functions when the evidence supports fraud are as follows: m5 (fraud) = α5 *(x) (19) m5 (~ fraud) = 0 (20) m5 (U) = 1 - m5 ( fraud ) = 1 - α5*(x) (21)
  • 28. Texas Tech University, Abhishek Agarwal, August 2012 21 Let y=1.0 be the likelihood of fraud if the click originated from a region outside of advertiser‟s business. The mass functions when the evidence supports ~fraud are as follows: m5 (fraud) = 0 (22) m5 (~fraud) = β5*(y) (23) m5 (U) = 1 - m5 (~ fraud ) = 1 - β5*(y) (24) Evidence 6: Creating of membership If the user creates a membership account (register as member) with the advertiser, then he/she is less likely to be a fraud. However he/she may or may not create such an account. Fraudsters however are less likely to register themselves at the ad-site or create membership account as they have no incentive in doing so and because it also requires them to spend some time and give out some information like email, address etc. The BMA for this rule supports a belief of ~fraud if a membership account was created, otherwise supports negligible belief of fraud. Let x=1 be the likelihood of fraud if a membership account is created. The mass functions when the evidence supports fraud are as follows: m6 (fraud) = α6* (x) (25) m6 (~fraud) = 0 (26) m6 (U) = 1 - m6 ( fraud ) = 1 - α6 *(x) (27) Let y=1 be the likelihood of ~fraud if a membership account is not created. The mass functions when the evidence supports ~fraud are as follows: m6 (fraud) = 0 (28) m6 (~ fraud) = β6 *(y) (29) m6 (U) = 1 - m6 ( ~fraud ) = 1 - β6 *(y) (30)
  • 29. Texas Tech University, Abhishek Agarwal, August 2012 22 Evidence 7: Adding a product in shopping cart If the user adds a product to his shopping cart, then he/she is less likely to be a fraud. Due to a lack of genuine interest in the advertiser‟s product or services, a fraudster is less likely to use a shopping cart. Using a shopping cart requires the user to spend time for which a fraudster has no incentive. The BMA for this rule supports a belief of ~fraud if a product was added to a cart otherwise supports a negligible belief of fraud. Let x=1.0 be the likelihood of fraud if the user does not add any product to his shopping cart. The mass functions when the evidence supports fraud are as follows: m7 (fraud) = α7* (x) (31) m7 (~fraud) = 0 (32) m7 (U) = 1 - m7 ( fraud ) = 1 – α7 *(x) (33) Let y=1.0 be the likelihood of ~fraud if the user adds a product to his shopping cart. The mass functions when the evidence supports ~fraud are as follows: m7 (fraud) = 0 (34) m7 (~ fraud) = β7*(y) (35) m7 (U) = 1 - m7 ( ~fraud ) = 1 - β7*(y) (36) Individually, the evidences are not sufficient in determining the likelihood of a user being fraud or ~fraud. Each evidence may give different or contradicting belief of fraud depending on their nature. But upon combination they provide a highly accurate estimate. Thus, the likelihood of a click being fraudulent is estimated by combining the beliefs obtained from corresponding mass functions for each of the supporting evidences. To define the rule for combining mass functions, suppose m1 and m2 be two distinct mass functions of a particular click. Dempster‟s rule of combination can be applied as shown below. For readability, we omit i, and replace {fi}, {~fi} and Ui by f, ~f and U, respectively. m1,2(f)= (m1(f)m2(f)+m1(f)m2(U)+m1(U )m2(f))(1K)
  • 30. Texas Tech University, Abhishek Agarwal, August 2012 23 m1,2(~f)=(m1(~f)m2(~f)+m1(~f)m2(U)+m1(U)m2(~f))(1K) m1,2(U)=(m1(U)m2(U ))(1K), where K = m1(f)m2(~f) + m1(~f)m2(f). This combination rule can be applied repeatedly pair-wise until evidence from all clicks has been incorporated into the computation of the likelihood of each statement. Our proposed approach certifies the clicks based on the corresponding likelihood of them being fraudulent using the beliefs combined from all of the evidences. Table 4.1 below describes the thresholds that we have empirically derived from our experiments and tests. Table 4.1 Fraud certification rules Lower Upper Not Fraud 0 0.499 Suspicious 0.5 0.649 Fraud 0.65 1 A combined belief of fraud < 0.5 indicates ~fraud. A combined belief of fraud >= 0.65 indicates fraud and all values in between indicate a suspicion.
  • 31. Texas Tech University, Abhishek Agarwal, August 2012 24 CHAPTER V DATA SET & ILLUSTRATION In this section we give a detailed explanation of the data that we use in our approach. We also show an illustrated example using our data set with our approach. Data Description Click data is not publicly available. Any real weblog data from a web server is a property of the owner of the server and are not made public due to privacy concerns by the owner. Moreover such data need to be cleaned to extract data in relevant format. This is a time consuming process and is not a focus of our research. For these reasons we use synthetic data for our research. Furthermore we can manipulate synthetic data and add patterns of fraud for evaluating different click fraud scenarios. The data show weblog from the advertiser‟s web server. For our experiments and evaluations we synthesize log data in combined log format (CLF). We pre-process the raw logs and extract the following information from them for each user in real time: IP address of the remote computer requesting the web page; time and date of request; the page that was requested; and the Gclid number. The region from which the click originated can be easily extracted from the IP address by using one of the many geo location services which map the IP to a place using geo location database. The Table 5.1 below shows a sample data extracted from the server logs.
  • 32. Texas Tech University, Abhishek Agarwal, August 2012 25 Table 5.1 Sample log data IP Address Click No Gclid No Time of click Requested Page Referrer 172.16.276.3 1 1001 3/5/2012 1:50 index.htm adsite.htm 172.16.276.3 2 1002 3/5/2012 1:56 index.htm adsite.htm 172.16.276.3 3 1002 3/5/2012 1:59 page1.htm index.htm 172.16.276.3 4 1002 3/5/2012 2:01 page2.htm page1.htm 172.16.276.3 5 null 3/5/2012 2:05 index.htm google.com 172.16.276.3 6 null 3/5/2012 2:08 page1.htm index.htm 172.16.276.3 7 null 3/5/2012 2:10 page2.htm page1.htm 172.16.276.3 8 null 3/5/2012 2:14 index.htm null 172.16.276.3 9 null 3/5/2012 2:16 page1.htm index.htm 172.16.276.3 10 null 3/5/2012 2:17 page2.htm page1.htm Each row of the Table 5.1 above represents a HTTP request by the user made to the advertiser‟s web server. Whenever a user requests content from the advertiser an HTTP request is generated. Below are some observations which describe data represented by the Table 5.1.  Every row represents a click by the user requesting content from the ad-site.  All the clicks in the table above are by the same user since the IP address is the same for all rows of the log.  Index.htm is the landing page. Every time index.htm is the requested page, it implies a new visit. The Table 5.1 has 4 unique visits.  A non-null Gclid number implies an ad-visit. Click numbers 1 through 4 belong to an ad- visit since they have a valid Gclid number attached.  Two different Gclid numbers above imply two different ad-visits. The first click with Gclid number 1001 implies an ad-visit. Since there is only 1 row with Gclid number 1001, it implies that the user did not make any other page requests after landing on the ad-site during first ad-visit. The second click with Gclid number 1002 is also an ad-visit.
  • 33. Texas Tech University, Abhishek Agarwal, August 2012 26 However in this visit the user requested page1.htm and page2.htm also (click number 3 and 4).  Each row with a null Gclid number implies a non-ad visit. Click numbers 5 through 10 correspond to two non-ad visits.  Click number 5 corresponds to first non-ad visit and the third visit overall. The visitor was referred to the ad-site by Google search since google.com is the referrer. After landing the user requested two more pages in the same visit, page1.htm and page2.htm.  Click number 8 corresponds to second non-ad visit and fourth visit overall. A null referrer implies that the user may have typed in the ad-site‟s URL in his browser or had previously bookmarked the site and clicked on the bookmark. After landing the user requested two more pages in the same visit, page1.htm and page2.htm.
  • 34. Texas Tech University, Abhishek Agarwal, August 2012 27 We will use a timeline diagram to help illustrate our inputs (like Table 5.1) for the rest of the thesis. Figure 5.1 shows the legends for the diagram and Figure 5.2 shows a timeline diagram corresponding to the input from Table 5.1. Figure 5.1 Legends for timeline diagram Figure 5.2 Timeline diagram sample data in Table 5.1 A timeline diagram is a visual representation of a user‟s clicking data from the server weblogs. Just by looking at Figure 5.2 we can easily make certain observations. The user has made 4 unique visits. The first two visits were ad-visits and the last two were non-ad visits. The width of the session blocks indicates session durations. The first visit was a very short session in which the user did not request any pages after landing. In all the other visits the user requested two other pages and the session durations are longer. The start and end times of every session is also given. Lastly we can see that the user neither logged in as a member in any of the sessions nor used a shopping cart.
  • 35. Texas Tech University, Abhishek Agarwal, August 2012 28 Example of belief computation using mass function and combination In this example we analyze and compute the belief of a user being fraud or ~fraud using our approach. The purpose is to explain the approach and the computations involved along with a simple example. The following is a sample input in Table 5.2 below. Table 5.2 Input from server log IP Address Click No Gclid No Time of click Requested Page Referrer 172.16.276.3 1 1001 3/5/2012 1:56 index.htm adsite.htm 172.16.276.3 2 1002 3/5/2012 2:01 index.htm adsite.htm 172.16.276.3 3 1003 3/5/2012 2:07 index.htm adsite.htm 172.16.276.3 4 1004 3/5/2012 2:13 index.htm adsite.htm 172.16.276.3 5 1005 3/5/2012 2:18 index.htm adsite.htm 172.16.276.3 6 1006 3/5/2012 2:23 index.htm adsite.htm From Table 5.2 above we can easily conclude that the user made six ad-visits. The user did not request any page of ad-site other than index.htm. Figure 5.3 below shows the timeline diagram for the data corresponding to Table 5.2. Figure 5.3 Timeline diagram for Table 5.2 As soon as a row is logged corresponding to a user activity, the system reads it immediately and computes the mass beliefs for each piece of evidence which are then combined to get an overall belief score using Dempster‟s combination rule. For the Table 5.2
  • 36. Texas Tech University, Abhishek Agarwal, August 2012 29 above six belief values will be computed corresponding to every click. Thus the belief about the user changes with every user click and is updated. The evidence combination process combines beliefs from each conflicting evidence and gives a belief score for a user‟s each click. To demonstrate our approach we will work out the calculation of belief values at the 6th click. Please note that we use the α and β values from Table 5.3. These values have been derived empirically with our experiments and will be used with all our computations. Table 5.3 Coefficient values Evidence No α β 1 0.8 - 2 - 0.99 3 0.6 0.2 4 0.2 0.01 5 0.4 0.1 6 0.02 0.25 7 0.01 0.2 Evidence 1 always supports a belief of fraud and therefore at the 6th click on the ad the mass function values are: m1 (fraud) = 0.8* (1-1/6) = 0.667 m1 (~fraud) = 0 m1 (U) = 1 - m1* ( fraud ) = 1 – 0.8 *(1-1/6) = 0.332 Evidence 2 always supports a belief of ~fraud. The user spends 30 seconds in each visit since he does not open any other page and therefore the total time spent is 180 seconds. The window size W is 1800 seconds. Therefore the mass function values are: m2 (~ fraud) = 0.99 *(180/1800) = 0.099
  • 37. Texas Tech University, Abhishek Agarwal, August 2012 30 m2 (fraud) = 0 m2 (U) = 1 - m2 *(~ fraud ) = 1 – 0.99* (180/1800) = 0.901 Evidence 3 supports a little belief of fraud since there was no non-ad visit by the user. Therefore the mass function values are: m3 (fraud) = 0.6* (0.1) = 0.06 m3 (~ fraud) = 0 m3 (U) = 1 - m3 *( fraud ) = 1 – 0.6 *(0.1) = 0.94 Evidence 4 supports a belief of fraud since the 6th click occurs at a suspicious time (2:23 AM). Therefore the mass function values are: m4 (fraud) = 0.2*(1) = 0.2 m4 (~ fraud) = 0 m4 (U) = 1 - m4 *( fraud ) = 1 - 0.2*(1) = 0.8 Evidence 5 supports a belief of fraud since we assume that the IP originates from a region outside the area of business of the advertiser. Therefore the mass function values are: m5 (fraud) = 0.4 *(1) = 0.4 m5 (~ fraud) = 0 m5 (U) = 1 - m5* (fraud) = 1 – 0.4 *(1) = 0.6 Evidences 6 and 7 support a little fraud since no product was added to a shopping cart and neither was a membership account used. Therefore the mass function values are: m6 (fraud) = 0.02 *(1) = 0.02
  • 38. Texas Tech University, Abhishek Agarwal, August 2012 31 m6 (~fraud) = 0 m6 (U) = 1 - m7 *(fraud) = 1 – 0.02*(1) = 0.98 m7 (fraud) = 0.01* (1) = 0.01 m7 (~fraud) = 0 m7 (U) = 1 - m8* (fraud) = 1 – 0.01* (1) = 0.99 From Table 5.4 below we can observe that each mass function gives a varying degree of belief values and these can be conflicting. Table 5.4 Mass function beliefs for illustrated example belief(fraud) belief(~fraud) m1 0.667 0 m2 0 0.099 m3 0.06 0 m4 0.2 0 m5 0.4 0 m6 0.02 0 m7 0.01 0 Now we can apply the Dempster’s rule of combination to get the combined belief about the user from the mass beliefs in Table 5.4. K = m1(f)m2(~f) + m1(~f)m2(f) = 0.066 1-K = 0.934 m1,2(f) = m1(f)m2(f)+m1(f)m2(U)+m1(U )m2(f)/(1-K) = 0.643 m1,2(~f) =m1(~f)m2(~f)+m1(~f)m2(U)+m1(U)m2(~f)/(1-K) = 0.035 m1,2(U )= m1(U)m2(U )/(1-K) = 0.321
  • 39. Texas Tech University, Abhishek Agarwal, August 2012 32 m1,2 is the combined mass belief from functions 1 and 2. Next we combine this with mass functions for function 3 to get the combined mass belief m1,2,3 K = m1,2(f)m3(~f) + m1,2(~f)m3(f) = 0.0021 1-K = 0.998 m1,2,3(f) = m1,2(f)m3(f)+m1,2(f)m3(U)+m1,2(U )m3(f) = 0.664 m1,2,3(~f)= m1,2(~f)m3(~f)+m1,2(~f)m3(U)+m1,2(U)m3(~f) = 0.0333 m1,2,3(U ) = m1,2(U)m3(U ) = 0.303 The above belief combination repeats until no more evidence needs to be considered. Thus, the belief of the hypothesis that click 6 is fraudulent is calculated in accumulative fashion. Following the procedure we go on to get the combined belief of all mass beliefs m1,2,3….7 m1,2,3….7(f) = 0.840 m1,2,3….7(~f) = 0.016 m1,2,3….7(U ) = 0.144 As we can clearly see, the belief (fraud) of 0.84 is clearly above the threshold for fraud (0.65) given in Table 4.1 and so the user is certified as fraud. Figure 5.4 gives a graphical representation of the combined belief of fraud over all the 6 clicks made by the user (in this example we have worked out the mass value computation of 6th click only but the figure plots the mass values computed for all clicks from 1st through 6th ). We can easily observe how the combined belief changes as more clicks are made.
  • 40. Texas Tech University, Abhishek Agarwal, August 2012 33 Figure 5.4 Combined belief of fraud for input in Figure 5.3
  • 41. Texas Tech University, Abhishek Agarwal, August 2012 34 CHAPTER VI EVALUATION In this section we present two case studies (scenarios), each of which corresponds to a different type of click fraud attack. In case study 1 we present a scenario where a human user is trying to perform click fraud and uses different click patterns in order to avoid detection. In case study 2 we present a scenario where a software bot is used to perform click fraud and it tries to make detection difficult by using multiple IP addresses. In both the cases we present our output and show that our approach is able to successfully detect click fraud. We will discuss the generality of our solution in Chapter VII. Case Study 1 We present a scenario where a human user is trying to commit click fraud and avoid detection by giving an impression of a regular user. Figure 6.1 below show the user activity for the test case. Figure 6.1 Timeline input for Case Study 1 A fraudster needs to repeatedly click on the ad in order to make a substantial profit. In this case the fraudster clicks the ad seven times (leading to seven ad-visits). The fraudster also
  • 42. Texas Tech University, Abhishek Agarwal, August 2012 35 enters the ad-site via a regular search (non-ad visit) to give a stronger impression of a regular user. He/she spends time on the site after landing (with random session durations) and carries out activities like opening 32 links in the ad-site after landing, creating membership account and adding a product to his shopping cart. Below we describe the belief computed from every mass function and the combined belief in figures 6.2 through 6.9. We have plotted the belief value with time (in the range of window W). Please note that some of the functions support both fraud and ~fraud at different times depending on the input and thus they can have both types of beliefs at different times. In these cases we just show belief of fraud for the purpose of clarity. Also note that whenever a function supports belief in ~fraud then the belief in fraud becomes 0 and vice versa.
  • 43. Texas Tech University, Abhishek Agarwal, August 2012 36 Figure 6.2 below shows the belief computed from Mass Function 1 (Number of clicks on the ad) according to which if the number of clicks on the ad from an IP in the time window W (30 minutes) is high, then likelihood of the user being a fraud is high. Mass Function 1 supports only a belief of fraud and the belief at the first click on the ad is 0. The belief increases as more clicks are made on the ad. The increase is faster in the first five clicks due to the nature of the function. It is notable that the belief of fraud does not increase in the third visit as it is a non-ad visit. This function does not consider any other user activity apart from the number of clicks on the ad. Therefore user activities like a non-ad visit (third visit), adding products to shopping cart etc. do not affect the belief of this mass function. Figure 6.2 Belief of fraud from mass function 1
  • 44. Texas Tech University, Abhishek Agarwal, August 2012 37 Figure 6.3 below shows the belief computed from Mass Function 2 (Time spent in browsing) according to which if the time spent by the user at the ad-site is high then he/she is less likely to be a fraud. This function supports only the belief of ~fraud. In this case study the user spent time in every session and this is reflected in an increasing belief of ~fraud. This belief clearly contradicts the belief from Mass Function 1 which supports a belief of fraud. The fraudster has spent a considerable time browsing the ad-site during every visit to give an impression of a genuine user. As we can see below the user has a high belief of ~fraud at the end. Figure 6.3 Belief of ~fraud from mass function 2
  • 45. Texas Tech University, Abhishek Agarwal, August 2012 38 Figure 6.4 below shows the belief computed from Mass Function 3 (Ad-visit after non-ad visit) according to which if a user clicks on an ad after a non-ad visit, he/she is likely to be a fraud. Once a user makes a non-ad visit to the ad-site, it implies that the user is aware how to reach the site apart from clicking on the ad. The first three visits are all ad-visits and therefore the function supports a little belief of fraud. The fourth visit is a non-ad visit and therefore the function does not support fraud (belief become 0). But the fifth visit is an ad- visit (after non-ad visit). The function computes a high belief of fraud because of this and we see that the belief of fraud spikes up to 0.6. Figure 6.4 Belief of ~fraud from mass function 3
  • 46. Texas Tech University, Abhishek Agarwal, August 2012 39 Figure 6.5 below shows the belief computed from Mass Function 4 (Time of click) according to which if the click occurred in the most suspicious time (or most active period of fraud activity) then the user is likely to be a fraud.. The first three visits are not during the most suspicious time for fraud therefore the function does not support a belief of fraud. During the fourth visit the session enters the suspicious time and therefore the function supports fraud. The curve below shows this increased belief. Figure 6.5 Belief of fraud from mass function 4
  • 47. Texas Tech University, Abhishek Agarwal, August 2012 40 Figure 6.6 below shows the belief computed from Mass Function 5 (Place of origin of click) according to which if the click originated from a location (country, state or city) where the advertiser has no business then the user is likely to be a fraud. For this case study we assume that the IP address of the user is from a region outside of the advertiser‟s region of business. A click from such an IP is not natural and the advertiser will not benefit from it. The function therefore supports a belief of fraud throughout and this value does not change at any time. Figure 6.6 Belief of fraud from mass function 5
  • 48. Texas Tech University, Abhishek Agarwal, August 2012 41 Figure 6.7 below shows the belief computed from Mass Function 6 (Creation of membership) according to which if the user creates a membership account (register as member) with the advertiser, he/she is less likely to be a fraud. The user does not create any membership or registration with the advertiser during the first three visits. However during the fourth visit the user does create it and therefore this mass function changes its belief to support ~fraud from 0 to 0.25. Figure 6.7 Belief of ~fraud from mass function 6
  • 49. Texas Tech University, Abhishek Agarwal, August 2012 42 Figure 6.8 below shows the belief computed from Mass Function 7 (Adding a product to shopping cart) according to which if the user adds a product to his shopping cart, he/she is less likely to be a fraud. The user does not use the shopping cart during the first three visits. However during the fourth visit the user does add a product to it and therefore this mass function belief to support ~fraud increases from 0 to 0.2. Figure 6.8 Belief of ~fraud from mass function 7
  • 50. Texas Tech University, Abhishek Agarwal, August 2012 43 The system combines the mass beliefs and a combined belief corresponding to each click is computed. Table 6.1 below shows the computed values of belief, plausibility and deduction for every click. Table 6.1 Computed belief values for Case Study 1 click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction 1 0.45 0.99 0.015 0.55 not fraud 2 0.44 0.98 0.022 0.56 not fraud 3 0.44 0.97 0.027 0.56 not fraud 4 0.44 0.96 0.036 0.56 not fraud 5 0.43 0.96 0.043 0.57 not fraud 6 0.43 0.95 0.049 0.57 not fraud 7 0.65 0.96 0.036 0.35 suspect 8 0.64 0.96 0.041 0.36 suspect 9 0.64 0.95 0.052 0.36 suspect 10 0.63 0.94 0.06 0.37 suspect 11 0.63 0.93 0.068 0.37 suspect 12 0.7 0.94 0.059 0.3 fraud 13 0.69 0.93 0.067 0.31 fraud 14 0.69 0.93 0.072 0.31 fraud 15 0.69 0.92 0.078 0.31 fraud 16 0.68 0.91 0.092 0.32 fraud 17 0.67 0.9 0.1 0.33 fraud 18 0.59 0.81 0.19 0.41 suspect 19 0.51 0.7 0.3 0.49 suspect 20 0.5 0.69 0.31 0.5 suspect 21 0.43 0.6 0.4 0.57 not fraud 22 0.42 0.59 0.41 0.58 not fraud 23 0.42 0.58 0.42 0.58 not fraud 24 0.8 0.87 0.13 0.2 fraud 25 0.8 0.86 0.14 0.2 fraud 26 0.79 0.86 0.14 0.21 fraud 27 0.79 0.85 0.15 0.21 fraud 28 0.8 0.86 0.14 0.2 fraud 29 0.79 0.85 0.15 0.21 fraud 30 0.78 0.84 0.16 0.22 fraud 31 0.78 0.84 0.16 0.22 fraud 32 0.79 0.84 0.16 0.21 fraud 33 0.78 0.84 0.16 0.22 fraud 34 0.78 0.83 0.17 0.22 fraud 35 0.77 0.82 0.18 0.23 fraud 36 0.76 0.81 0.19 0.24 fraud 37 0.77 0.81 0.19 0.23 fraud 38 0.76 0.81 0.19 0.24 fraud 39 0.75 0.8 0.2 0.25 fraud 40 0.74 0.79 0.21 0.26 fraud
  • 51. Texas Tech University, Abhishek Agarwal, August 2012 44 Figure 6.9 below shows the combined belief of fraud obtained by combining the beliefs from all the mass functions using Dempster‟s combination rule. It is interesting to note that individually the beliefs from mass functions contradict and give vary. However upon combination they give correct belief which changes to reflect the changes in user‟s activity. Figure 6.9 Combined belief of fraud for Case Study 1 Initially the combined belief of fraud is low and according to the threshold values in Table 4.1 it indicates a ~fraud. As the user clicks again on the ad (second visit), the belief of fraud increases and the user moves from ~fraud to suspicious. In the third ad-visit the belief of fraud increases further and indicates a fraud. But as the user does a non-ad visit (fourth visit), creates membership and uses shopping cart, the belief drops back to ~fraud. Had the user stopped clicking on the ad at this point he/she would have been considered ~fraud. However when the user clicks on ad again and makes an ad-visit (fifth visit) the belief increases to
  • 52. Texas Tech University, Abhishek Agarwal, August 2012 45 support fraud. We see that the change in belief spikes to a high value during fifth visit because this is an ad-visit after a non-ad visit. At the end the user‟s belief of fraud continues to be high and this is certified as a case of fraud. Also the time of click and the location of the IP contribute to the suspicion. Case Study 2 This case study presents a scenario where a software bot is used to commit click fraud by using different IP addresses at different times. Use of multiple IP addresses can make detection difficult. In most approaches to click fraud detection including ours, n different IPs will be considered n unique users. (Walgampaya et al., 2011) suggest a specialized approach to identify bot attacks. For the ease of clarity let us now consider that each IP belongs to a different user. Figure 6.10 below shows the activity from three different IP addresses (users) in a timeline diagram. We have used a different color mechanism for this timeline diagram to represent visits by three different IPs and do not show the time range of each session to avoid cluttering. Figure 6.10 Timeline diagram for Case Study 2
  • 53. Texas Tech University, Abhishek Agarwal, August 2012 46 Using each IP, two ad-visits are made out of which the first visit has a short session and in the second visit has longer sessions. The first two IPs are outside of the advertiser‟s region of business but the third IP originates from the advertiser‟s area of business. Last four visits lie in a suspicious time range. The system computes mass beliefs and a combined belief corresponding to each click from every IP. Tables 6.2, 6.3 and 6.4 below show the computed values of belief, plausibility and deduction for first, second and third IPs respectively. Table 6.2 Computed belief values for first IP click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction 1 0.45 0.99 0.015 0.55 not fraud 2 0.66 0.99 0.014 0.34 fraud 3 0.72 0.98 0.025 0.28 fraud Table 6.3 Computed belief values for second IP click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction 1 0.45 0.99 0.015 0.55 not fraud 2 0.66 0.98 0.02 0.34 fraud 3 0.53 0.94 0.061 0.47 suspect Table 6.4 Computed belief values for third IP click no belief(fraud) plausibility(fraud) belief(~fraud) plausibility(~fraud) Deduction 1 0.078 0.89 0.11 0.92 not fraud 2 0.73 0.99 0.0089 0.27 fraud 3 0.51 0.9 0.095 0.49 suspect
  • 54. Texas Tech University, Abhishek Agarwal, August 2012 47 Figure 6.11 below shows the computed values of belief of fraud for all visits by the bot using the three IPs. Figure 6.11 Combined belief values for Case Study 2 From the Figure 6.11 and Tables 6.2 to 6.4 above we can observe that our system detects the users with first two IPs as fraud and the user with the third IP as suspicious even when there were just two clicks that occurred from each IP. The third IP was not outside of advertiser‟s region of business and hence the system could conclude it as suspicious. The above clicks from three different IPs could be from one single bot. We evaluate them as three different users and yet detect the fraud.
  • 55. Texas Tech University, Abhishek Agarwal, August 2012 48 CHAPTER VII DISCUSSION & CONCLUSIONS The thesis proposes an approach for click fraud identification that can be used by the advertising community to solve their click fraud problems. Our approach is fundamentally different from existing methods. First, we focus on the type of clicking activity, which can create real value for the fraudster and attempt to detect that. For this we take raw weblog data and derive meaningful evidences for our mass function formulization. Second, it has the ability to do on-line computation to detect fraudulent clicks. Such computation adapts well to real-time systems and this is a key advantage. Third, the approach is relatively simple and fast because it requires only the incoming data at advertiser‟s disposal. It neither requires the advertiser to maintain and update large historical databases of various evidences nor necessitates learning of any patterns. This makes the approach beneficial for use by advertisers. Fourth, the resulting beliefs also indicate the gray area of suspicious activity which can alert the advertiser of irregular or abnormal traffic. This is useful against click fraud attacks which may be hard to catch but still falls in suspicious category. Finally, the approach suggests extraction of evidences from limited server data and can be extended easily by adding new mass functions to represent additional evidence. Our experiments on the two case studies show that the proposed approach works correctly. Although we have not experimented on all possible scenarios of click fraud behaviors we believe that our approach will work effectively in general because of the following reasons. First, the technique allows combination of a set of evidences that can contribute to click fraud detection. Second the set of evidences considered in this thesis is in the worst case near complete. Finally, if the set is not complete, the technique can be easily extended by adding new evidences into the proposed click fraud detection system. Future work includes more experiments to gain understanding of the characteristics of the proposed approach, for example, what are the novel click attacks which the approach fails to identify and if found, what are the other sources of data and evidences that can be identified to detect them. Future work also requires experiments to see if our approach works for
  • 56. Texas Tech University, Abhishek Agarwal, August 2012 49 specialized bot attacks which can be highly sophisticated and evolve continuously. These are among our ongoing and future research.
  • 57. Texas Tech University, Abhishek Agarwal, August 2012 50 BIBLIOGRAPHY D. Antoniou, M. Paschou, E. Sakkopoulos, E. Sourla, G. Tzimas, A Tsakalidis, E. Viennas, “Exposing click-fraud using a burst detection algorithm”, in Proceedings of ISCC on Computers and Communications, IEEE Symposium, Jun 2011, pp. 1111-1116. A. Tuzhilin, “The Lane‟s Gifts vs. Google Report”, 2006 M. Kantardzic, C. Walgampaya, B. Wenerstorm, O. Lozitskiy, S. Higgins and D. Kings, “Improving Click Fraud Detection by Real Time Data Fusion”, in Proceedings of the ISSPIT on Signal Processing and Information Technology, IEEE International Symposium, Dec. 2008, pp. 69-74. G. Shafer, “A Mathematical Theory of Evidence”, Princeton University Press, 1976. T. Denoeux, “ A K-nearest Neighbour Classification Rule based on Dempster-Shafer Theory”, IEEE Transactions on Systems, Man and Cybernetics, 25 (1995) 804-813. F. Dong, Sol. M. Shatz, H. Xu, “Reasoning Under Uncertainty For Shill Detection In Online Actuions using Dempster Shafer Theory”, International Journal of Software Engineering and Knowledge Engineering, 2010, pp. 943-973. K. Sentz, S Ferson, “Combination of Evidence in Dempster-Shafer Theory”, SAND 2002- 0835, April 2002. N. Kshetri, “The Economics of Click Fraud”, Security and Privacy, IEEE, May-June 2010, pp. 45-53. H. Haddadi, “Fighting Online Click-Fraud Using Bluff Ads”, ACM SIGCOMM Computer Communication Review, v.40 n.2, April 2010 [doi>10.1145/1764873.1764877] V. Anupam, A Mayer, K. Nissim, B. Pinkas, and M. K. Reither, “On the Security of pay-per- click and other web advertising schemes”, Computer Netwroks, 31(11-16): 1999, 1091- 1100. M. Kantardzic, C. Walgampaya, and H. Jamali, “Click fraud prevention in pay-per-click model: Learning through multimodel evidence fusion”, in Proceedings of ICMWI of Machine and Web Intelligence, 2010, pp. 20-27.
  • 58. Texas Tech University, Abhishek Agarwal, August 2012 51 C. Walgampaya, and M. Kantardzic, “Cracking the Smart ClickBot”, in Proceedings of Web Systems Evolution on 13th IEEE Symposium, 2011, pp. 125-134. B. J. Jansen, “Click Fraud”, IEEE Computer, vol. 40, no. 7, Jul 2007, pp. 85-86. X. Li, Y. Liu, and D. Zeng, “Publisher click fraud in the pay-per-click advertising market: Incentives and consequences”, in Proceeding of Intelligence and Security Inforatics of IEEE International Conference, 2011, pp. 207-209. S. Majumdar, D. Kulkarni, and C. V. Ravishankar , “Addressing Click Fraud in Content Delivery Systems”, in Proceedings of INFOCOM 2007 of 26th IEEE International Conference, May 2007, pp. 240-248. A. Metwally, D. Agarwal, A. Abbadi, and Q. Zheng, “On Hit Inflation and Detection in Streams of Web Advertising Networks”, in Proceedings of Distributed Computing Systems on ICDCS, Jun 2007, pp. 52-52. lAB, and PwC, “lAB Internet Advertising Revenue Report, 2010”, First Half-Year Results, New York, U.S., 2011. GeegkWire Magazine, “Newspapers take it on the chin as online ad revenue falls into the hands of a few tech giants”, Mar 2012, http://www.geekwire.com/2012/newspapers-chin- online-ad-revenue-falls-hands-tech-giants/ Google Earnings Report, “Google Announces Second Quarter 2011 Financial Results”, Jul 2011, http://investor.google.com/earnings/2011/Q2_google_earnings.html