Anomaly detection Workshop slides

Location:
Qcon.ai Conference
San Francisco
April 15th 2019
Anomaly Detection
Techniques and Best Practices
2019 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com

2
• Introduction
• Applications of Anomaly Detection
• Break – 10.30-10.45am
• Aspects of Anomaly Detection
• Lunch Break : 12.00-1.00pm
• Techniques- Deep Dive
• Break – 2.30-2.45pm
• Labs and Examples
Agenda

- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits

• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Analytics Faculty in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
5

6
Quantitative Analytics and Big Data Analytics Bootcamps
• Analytics Certificate program
• Fintech Certificate program
• Deep Learning & AI boot camp
• Natural Language Processing
workshop
• Machine Learning for Finance
• Machine Learning for Healthcare
applications
See www.analyticscertifcate.com for
current and future offerings

(MATLAB version also available)

What is anomaly detection?
• Anomalies or outliers are data points within the datasets
that appear to deviate markedly from expected outputs.
• An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism1
• Anomaly detection refers to the problem of finding
patterns in data that don’t confirm to expected behavior
10
1. D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.

11
• Outliers are data points that are considered out of the ordinary or
abnormal . This includes noise.
• Anomalies are a special kind of outlier that has significant/
critical/actionable information which could be of interest to
analysts.
Anomaly vs Outliers
1
2
All points not in clusters 1 & 2 are Outliers
Point B is an Anomaly (Both X and Y are large)

12
• Note that it is the analyst’s judgement that determines
what is considered as just an outlier or is an anomaly.
• Most outlier detection methods generate an output that
are:
▫ Real-valued outlier scores: quantifies the tendency of a
data point being an outlier by assigning a score or
probability to it.
▫ Binary labels: result of using a threshold to convert
outlier scores to binary labels, inlier or outlier.
Outlierness

13
• Fraud Detection
▫ Credit card fraud detection
– By owner or by operation
▫ Mobile phone fraud/anomaly detection
– Calling behavior, volume etc.
▫ Insurance claim fraud detection
– Medical malpractice
– Auto insurance
▫ Insider trading detection
• E-commerce
▫ Pricing issues
▫ Network issues
Applications of Anomaly Detection

14
• Intrusion detection:
▫ Detect malicious activity in computer systems
▫ This could be host-based or network-based
• Medical anomalies
Examples of Anomaly Detection

15
• Manufacturing and sensors:
▫ Fault detection
▫ Heat, fire sensors
• Text data
▫ Novel topics, events
▫ Plagiarism
Examples of Anomaly Detection

16
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems

17
• By definition, anomaly detection deals with identifying patterns and
points that are not considered normal. This implies that we must
first have a model to define what is normal in our datasets.
• If our model doesn’t capture the nuances of “normal” behavior in
our datasets, our anomaly detection algorithms won’t fare well.
• If our model captures all nuances in our data, we would have overfit
the model and wouldn’t be able to identify anomalies properly
• If our model is too generic, then most points would show up as
anomalies.
• Let’s illustrate this

18
• Consider the points below. Our goal is to build a “normal” model
and an anomaly detection model for the following data set.
• y = [2,5,7,8,11,5,3 4 5.8 8]
• Choose a “normal model”.
• Determine the standard deviation and mark any point that is
outside of the 3σ limit as an anomaly
Model assumption

19
• Consider three possible “normal” models
▫ Line A connects all the points : Over fit : No anomalies
▫ Line B is a linear regression line : Poor it : Many points listed as
anomalies
▫ Line C is a polynomial fit of degree 4 : Good fit. One point shown as
anomaly as it is outside of the 3σ band.
This illustrates why the choice of
The “normal” model is critical
Model assumption

20
• We could assume that the points may come from a statistical
distribution. For example from a Gaussian distribution
• We could use an algorithm like k-nearest neighbor to define points
that are close and identify outliers that are at a large distance from
most of the points
• We could use a clustering algorithm to assign membership to clusters
Determining the “normal” model

21
• We will illustrate once more why the choice of the “normal” model
is important.
• We said earlier that the “outlierness” can be quantified as a score.
• One popular score is Z-score.
• A z-score can be calculated from the following formula.
z= (X - μ) / σ
• where z is the z-score, X is the value of the element, μ is the
population mean, and σ is the standard deviation.
• It computes the number of standard deviations by which the data
varies from the mean and used as a proxy to determine outliers.
• However, this works only if the data is from a Gaussian distribution
Importance of a defining what is normal

22
• In the first case, 99.9% of the data is within the 3σ limits. The Z-
score test works well here to detect outliers.
• In the second case, the distribution isn’t normal and is from a Zipf
distribution. Here a Z-score test isn’t valid

23
• It is important for the analyst choose the right representation for
the normal model.
• In the first case, Clustering is useless but a Linear regression is a
good representation.
• In the second case, Clustering is more appropriate.

24

25
• Defining the normal region that covers all normal points in the
dataset
• Identifying anomalies masquerading as normal data points or not
being able to uncover anomalies due to a weak choice of the
“normal model”
▫ Example : DOS attack vs DDOS attack
2.0 Challenges when dealing with Anomaly Detection
problems

26
• The evolving “normal behavior” in data
▫ Example:
– $100+ credit card transactions for a student – Average 5 per month
– $100+ credit card transactions for a professional – Average 15 per month
• Data and application dependency
▫ Example :
– For AAPL, a +/- $5 fluctuation in a day is an anomaly.
– For a risky stock, up to +/- $10 fluctuation may be normal and $15 flucation
may be an anomaly
• Lack of labeled data makes it harder to detect anomalies
• Not being able to distinguish noise and anomalies
2.0 Challenges when dealing with Anomaly Detection
problems

27

28
• Data objects are usually described by a set of attributes (variables,
features or dimension)
• The term univariate is used when data has one attribute, while
bivariate(two) and multivariate(more that two attributes data)
terms.
• Attributes can be quantitative or qualitative based on their
characteristics.
3.0 Input data characteristics

Dataset, variable and Observations
Dataset: A rectangular array with Rows as observations and
columns as variables
Variable: A characteristic of members of a population ( Age, State
etc.)
Observation: List of Variable values for a member of the
population

Types of Data
— A variable is numerical if meaningful arithmetic can be
performed on it.
— Discrete vs. Continuous
— Cross-sectional vs. Longitudinal
— Otherwise, the variable is categorical.
— Binary vs Multivalued
— Ordinal vs Nominal

Categorial Variables
• Ordinal: natural ordering of its possible values.
• Nominal : no natural ordering

Categorical Variables
• Categorical variables can be coded numerically or left uncoded.
• A dummy variable is a 0–1 coded variable for a specific category. It is
coded as 1 for all observations in that category and 0 for all
observations not in that category.
• Categorizing a numerical variable as categorical is called binning
(putting the data into discrete bins) or discretizing.

Numerical Data
• Discrete :
▫ How many cities have you lived in?
▫ How many cars pass through a toll booth?
• Continuous:
▫ What’s your height?
▫ What’s the temperature outside?
▫ What was the interest rate in Dec 2004?

Numerical data
• Longitudinal
• Cross-sectional

35
• In Cross-sectional records, data instances are independent.
• Data instances could be related to each other (typically longitudinal)
• Types of data records that are typically related to each other
▫ Sequence records:
– Time-series data – temporal continuity
– Genome and protein sequences
▫ Spatial Data : Data is related to neighbors; Spatial continuity
– Traffic data
– Weather data
– Ecological data
▫ Graph data
– Social media data
Relationship between records

36

37
Anomalies can be classified into three major categories
1. Point Anomalies
– In an instance is anomalous compared with the rest of instances, the anomaly
is considered a point anomaly
2. Contextual Anomalies
– If an instance is anomalous in a specific context, the anomaly would be
considered as a contextual anomaly
3. Collective Anomalies
– If a collection of related data records are anomalous with respect to the entire
data set, the anomaly is a collective anomaly
4.0 Anomaly Classification

38
• In the figure, points o1 and o2 are considered point anomalies
• Examples:
▫ A 50% increase in daily stock price
▫ A credit card transaction attempt for $5000 (assuming you have never
had a single transaction for anything above $1000)
Point Anomalies

39
• In the figure, temperature t2 is an anomaly
• Note that t1 is lower than t2 but contextually, t1 is expected and t2
isn’t when compared to records around it.
Contextual Anomalies

40
• Multiple Buy Stock transactions and then a sequence of Sell transactions
around an earnings release date may be anomalous and may indicate
insider trading.
• Consider the sequence of network activities recorded
• Though ssh, buffer-overflow and ftp themselves are not anomalous
activities, a sequence of the three indicates a web-based attack
• Similarly, multiple http requests from an ip address may indicate a
crawler in action.
See https://www.bloomberg.com/graphics/2019-etf-tax-dodge-lets-
investors-save-big/
Collective Anomalies

41
• In medicine, abnormal ECG pattern detection would involve looking
for collective anomalies like Premature Atrial Contraction
Collective Anomalies
http://www.fprmed.com/Pages/Cardio/PAC.html

42

43
• The goal of an Outlier detection or Anomaly Detection algorithm is
to identify if there are anomalies in the data. The outputs would be
of the form:
▫ Scores : A number generated by the algorithm for each record.
Typically, the scores are sorted and a threshold chosen to designate
anomalies
▫ Labels : Here the algorithm takes a binary decision on whether the
algorithm is an anomaly or not
5. Outputs of Anomaly Detection algorithms

44
• Based on whether the data is labeled or not, machine learning
algorithms can be used for anomaly detection
• If the historical data is labeled (anomaly/not), supervised techniques
can be used
• If the historical data isn’t labeled, unsupervised algorithms can be
used to figure out if the data is normal/anomalous
5. Outputs of Anomaly Detection algorithms

45

46
1. Extreme value analysis
2. Classification-based techniques
3. Statistical techniques
i. Parametric techniques
a. Gaussian model-based models
b. Regression-based models
c. Mixture models
ii. Non-Parametric techniques
a. Histogram-based models
b. Kernel-based models
A tour of Anomaly Detection Techniques

47
4. Proximity-based models
i. Cluster analysis
ii. Nearest neighbor analysis
5. Information theoretic models
6. Meta-algorithms and ensemble techniques
i. Sequential ensembles
ii. Independent ensembles
A tour of Anomaly Detection Techniques

48
• Assumption : Anomalies are extreme values in the data set
• Goal: Determine the statistical tails of the underlying distribution
• Data : Univariate cross-sectional data
• Examples:
▫ Z-score test for a dataset which is assumed to be Normal
▫ Grubb’s test
▫ Using Box-plots to detect anomalies
1. Extreme value Analysis

49
• Assumption : Available labeled data
• Goal: Build a classifier that can distinguish between normal and
anomalous data
• Data : Multi-dimensional cross-sectional data ( Numeric/categorical)
• Examples:
▫ Rule-based (Decision trees)
▫ Neural networks
▫ SVM
2. Classification based techniques

50
• Statistical techniques fit a statistical model to a given data and
applies a statistical inference test to determine if an unseen instance
belongs to this model or not.
• Based on the assumed statistical model that describes the data,
anomalies are data points that are assumed to have not been
generated by the model
3. Statistical techniques

51
• Assumption : The underlying distribution is known and the
parameters for the distribution can be estimated
• Goal: Infer if a data point belongs to the distribution or not.
• Data : Depends on the technique
• Techniques:
▫ Gaussian techniques
– IQR test (Box plots)
– The region between Q1 −1.5*QR and Q3 +1.5*QR contains 99.3% of observations,
and hence the choice of the 1.5IQR boundary makes the box plot rule equivalent
to the 3σ technique for Gaussian data.
– Grubb’s test (later)
– Chi-squared test (later)
3. Statistical techniques - Parametric

52
• Regression based techniques
– Here the goal is to model the data into a lower dimensional sub-space
using linear correlations.
– i.e. summarize data in to a model parameterized by the coefficients and
constant.
– Step 1: Build a regression model with various features
– Step 2: Review residuals ; The magnitude of the residuals indicate
anomalies.

53
• Mixture models
– Example: Gaussian mixture model
– Here, the data is characterized by a process that is a mixture of Gaussian clusters.
– The parameters are estimated using an EM algorithm
– The goal is to determine the probability of data points being in different clusters.
– Anomalies would have low probability values

54
• Assumption : The data’s distribution isn’t known apriori
• Goal: Infer if a data point belongs to the assumed normal model
• Data : Depends on the technique
• Techniques:
▫ Histogram
– Count/Frequency based: Create histogram. Bins with very few points
indicate anomalies
▫ Kernel-based
– Using density-estimation techniques, build kernel functions and estimate
probability-distribution function(pdf) for normal instances. Instances lying
in the low-probability areas termed anomalies
3. Statistical techniques - Nonparametric

55
Example:
Check https://www.r-bloggers.com/exploratory-data-analysis-combining-histograms-and-density-
plots-to-examine-the-distribution-of-the-ozone-pollution-data-from-new-york-in-r/

56
• Assumption : Anomalous points are isolated from the rest of the
data
• Goal: Segment the points/space with a goal of identifying
anomalies.
• Data : Typically, multi-dimensional cross-sectional data
• Methods:
▫ Clustering : Unsupervised techniques to group data into clusters using
distances/densities depending on the technique. Anomalies belong to
sparse clusters/ no clusters and are typically far off from “normal”
clusters.
▫ Examples : K-means
4.0 Proximity-based techniques

57
▫ Nearest neighbor techniques: Here, it is assumed that normal points
occur in dense neighborhoods and anomalies are far from neighbors
▫ Here distances/relative densities are used to determine neighborhoods
▫ Examples :
– KNN algorithm : Anomaly score of a data instance is defined as its distance
to it’s kth nearest neighbor in a given data set.
– Local Outlier Factor scores : The anomaly score (LOF) is equal to the ratio
of average local density of the k-nearest neighbors of the instance and the
local density of the data instance itself.
4.0 Proximity-based techniques

58
• Assumption : Anomalies induce irregularities in the information
content of the data set that increase information summaries
• Goal: Identify data that can’t be summarized into a lower
dimensional space efficiently
• Example:
• The first line can be summarized as AB -17 times
• With C present in the second line, the second line can no longer be
succinctly summarized
5. Information theoretic models

59
• Assumption : Using multiple algorithms would help increase the
robustness of the anomaly detection algorithm
• Goal: Use ensembles to enhance the quality of anomaly detection
• Methods:
▫ Sequential ensembles : A given algorithm or a sequence of algorithms
are applied sequentially. Example: Boosting methods for classification
▫ Independent ensembles: Different algorithms or different instances of
the same algorithm are run and results combined to detect robust
outliers.
6. Meta-algorithms and sequential techniques

60

61
• For Unsupervised cases, hard as data
isn’t labeled
• For Supervised learning, ROC curve
• The true-positive rate is also known
as sensitivity, or recall in machine
learning.
• The false-positive rate is also known
as the fall-out and can be calculated
as (1 - specificity).
• The ROC curve is thus the sensitivity
as a function of fall-out.
7.0 Performance evaluation

62
• For Unsupervised cases, hard as data
isn’t labeled
• For Supervised learning, ROC curve
• The true-positive rate is also known
as sensitivity, or recall in machine
learning.
• The false-positive rate is also known
as the fall-out and can be calculated
as (1 - specificity).
• The ROC curve is thus the sensitivity
as a function of fall-out.
7.0 Performance evaluation

63
• F-1 score considers both the precision p and the recall r of the test
to compute the score.
• The F1 score is the harmonic average of the precision and recall,
where an F1 score reaches its best value at 1 (perfect precision and
recall) and worst at 0
F-1 score

65
1. Graphical approach
2. Statistical approach
3. Machine learning approach
4. Density based approach
5. Time series approach
Illustration of five methodologies to Anomaly Detection

66
ü Boxplot
ü Scatter plot
ü Adjusted quantile plot
ü Symbol plot

Graphical approaches
• Graphical methods utilize extreme value analysis, by which outliers
correspond to the statistical tails of probability distributions.
• Statistical tails are most commonly used for one dimensional
distributions, although the same concept can be applied to
multidimensional case.
• It is important to understand that all extreme values are outliers
but the reverse may not be true.
• For instance in one dimensional dataset of
{1,3,3,3,50,97,97,97,100}, observation 50 equals to mean and isn’t
considered as an extreme value, but since this observation is the
most isolated point, it should be considered as an outlier.
67

Box plot
• A standardized way of displaying the
variation of data based on the five
number summary, which includes
minimum, first quartile, median, third
quartile, and maximum.
• This plot does not make any assumptions
of the underlying statistical distribution.
• Any data not included between the
minimum and maximum are considered
as an outlier.
68

Boxplot
69
See Graphical_Approach.R
Side-by-side boxplot for each variable

Scatter plot
• A mathematical diagram, which uses Cartesian coordinates for plotting ordered
pairs to show the correlation between typically two random variables.
• An outlier is defined as a data point that doesn't seem to fit with the rest of the
data points.
• In scatterplots, outliers of either intersection or union sets of two variables can
be shown.
70

Scatterplot
71
Scatterplot of Sepal.Width and Sepal.Length

72
• In statistics, a Q–Q plot is a probability plot, which is a graphical
method for comparing two probability distributions by plotting their
quantiles against each other.
• If the two distributions being compared are similar, the points in the
Q–Q plot will approximately lie on the line y = x.
Q-Q plot
Source: Wikipedia

Adjusted quantile plot
• This plot identifies possible multivariate outliers by calculating the Mahalanobis
distance of each point from the center of the data.
• Multi-dimensional Mahalanobis distance between vectors x and y in !" can be
formulated as:
d(x,y) = x − y ,S./(x − y)
where x and y are random vectors of the same distribution with the covariance
matrix S.
• An outlier is defined as a point with a distance larger than some pre-determined
value.
73

• Before applying this method and many other parametric
multivariate methods, first we need to check if the data is
multivariate normally distributed using different
multivariate normality tests, such as Royston, Mardia, Chi-
square, univariate plots, etc.
• In R, we use the “mvoutlier” package, which utilizes
graphical approaches as discussed above.
74

75
Min-Max normalization before diving into analysis
Multivariate normality test
Outlier Boolean vector identifies the
outliers
Alpha defines maximum thresholding proportion

76
Mahalanobis distances
Covariance matrix

77

Symbol plot
• This plot plots two dimensional data, using robust Mahalanobis distances based
on the minimum covariance determinant(mcd) estimator with adjustment.
• Minimum Covariance Determinant (MCD) estimator looks for the subset of h
data points whose covariance matrix has the smallest determinant.
• Four drawn ellipsoids in the plot show the Mahalanobis distances correspond to
25%, 50%, 75% and adjusted quantiles of the chi-square distribution.
78

Symbol plot
79
Parameter “quan” defines the amount of observations,
which are used for minimum covariance determinant
estimations. The default is 0.5.
Alpha defines the amount of observations used for
calculating the adjusted quantile.

Case study 1: Anomaly Detection With Freddie
Mac Data

81
ü Hypothesis testing ( Chi-square test, Grubb’s test)
ü Scores

Hypothesis testing
• This method draws conclusions about a sample point by testing whether it
comes from the same distribution as the training data.
• Statistical tests, such as the t-test and the ANOVA table, can be used on multiple
subsets of the data.
• Here, the level of signiﬁcance, i.e, the probability of incorrectly rejecting the
true null hypothesis, needs to be chosen.
• To apply this method in R, “outliers” package, which utilizes statistical
tests, is used .
82

Chi-square test
• Chi-square test performs a simple test for detecting outliers of univariate data
based on Chi-square distribution of squared difference between data and
sample mean.
• In this test, sample variance counts as the estimator of the population variance.
• Chi-square test helps us identify the lowest and highest values, since outliers
can exist in both tails of the data.
83

84
When an analyst attempts to fit a statistical model to observed data, he or she may wonder how well the model actually
reflects the data. How "close" are the observed values to those which would be expected under the fitted model? One
statistical test that addresses this issue is the chi-square goodness of fit test.
This test is commonly used to test association of variables in two-way tables where the assumed model of independence is
evaluated against the observed data. In general, the chi-square test statistic is of the form
.
If the computed test statistic is large, then the observed and expected values are not close and the model is a poor fit to the
data (anomaly).
Chi-square test

Chi-square test
85
See Statistical_Approach.R
This function repeats the Chi-square test until it finds all
the outliers within the data.

Grubbs’ test
• Test for outliers for univariate data sets assumed to come from a normally
distributed population.
• Grubbs' test detects one outlier at a time. This outlier is expunged from the
dataset and the test is iterated until no outliers are detected.
• This test is defined for the following hypotheses:
H0: There are no outliers in the data set
H1: There is exactly one outlier in the data set
• The Grubbs' test statistic is defined as:
86

Grubbs’ test
87
The above function repeats the Grubbs’ test until it finds
all the outliers within the data.

Grubbs’ test
88
Histogram of normal observations vs outliers)

Scores
• Scores quantifies the tendency of a data point being an outlier by assigning it a
score or probability.
• The most commonly used scores are:
▫ Normal score:
!" #$%&'
()&'*&+* *%,-&)-.'
▫ T-student score:
(0#(1+) '#2 )
(1+)(0#4#)5)
▫ Chi-square score:
!" #$%&'
(*
2
▫ IQR score: 67-64
• By using “score” function in R, p-values can be returned instead of scores.
89

Scores
90
“type” defines the type of the score, such as
normal, t-student, etc.
“prob=1” returns the corresponding p-value.

Scores
91
By setting “prob” to any specific value, logical vector
returns the data points, whose probabilities are
greater than this cut-off value, as outliers.
By setting “type” to IQR, all values lower than first
and greater than third quartiles are considered and
difference between them and nearest quartile
divided by IQR is calculated.

92
ü Linear regression
ü Piecewise/ segmented regression
ü Autoencoder-Decoder
ü Clustering-based approaches

Linear regression
• Linear regression investigates the linear relationships between variables and
predict one variable based on one or more other variables and it can be
formulated as:
! = #$ + &
'()
*
#'+'
where Y and +' are random variables, #' is regression coefficient and #$ is a
constant.
• In this model, ordinary least squares estimator is usually used to minimize the
difference between the dependent variable and independent variables.
93

Piecewise/segmented regression
• A method in regression analysis, in which the independent variable is
partitioned into intervals to allow multiple linear models to be fitted to data for
different ranges.
• This model can be applied when there are ‘breakpoints’ and clearly two
different linear relationships in the data with a sudden, sharp change in
directionality. Below is a simple segmented regression for data with two
breakpoints:
! = #$ + &'( ( < ('
! = #' + &*( ( > ('
where Y is a predicted value, X is an independent variable, #$ and #' are
constant values, &' and &* are regression coefficients, and (' and (* are
breakpoints.
94

95
Anomaly detection vs Supervised learning

• For this example, we use “segmented” package in R to first illustrate piecewise
regression for two dimensional data set, which has a breakpoint around z=0.5.
96
See Piecewise_Regression.R
“pmax” is used for parallel maximization to
create different values for y.

• Then, we use linear regression to predict y values for each segment of z.
97

• Finally, the outliers can be detected for each segment by setting some rules for
residuals of model.
98
Here, we set the rule for the residuals corresponding to z
less than 0.5, by which the outliers with residuals below
0.5 can be defined as outliers.

99
• Motivation1:
Autoencoders
1. http://ai.stanford.edu/~quocle/tutorial2.pdf

100
• Goal is to have !" to approximate x
• Interesting applications such as
▫ Data compression
▫ Visualization
▫ Pre-train neural networks
Autoencoder

101
Demo in Keras1
1. https://blog.keras.io/building-autoencoders-in-keras.html
2. https://keras.io/models/model/

102
Principal Component Analysis
Principal component analysis (PCA) is a statistical
procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated
variables (entities each of which takes on various
numerical values) into a set of values of linearly
uncorrelated variables called principal components.
In Outlier analysis, we do principal component
analysis and compute p-values to test for outliers.
https://en.wikipedia.org/wiki/Principal_component_analysis

Clustering-based approaches
• These methods are suitable for unsupervised anomaly detection.
• They aim to partition the data into meaningful groups (clusters) based on the
similarities and relationships between the groups found in the data.
• Each data point is assigned a degree of membership for each of the clusters.
• Anomalies are those data points that:
▫ Do not ﬁt into any clusters.
▫ Belong to a particular cluster but are far away from the cluster centroid.
▫ Form small or sparse clusters.
103

• These methods partition the data into k clusters by assigning each data point to
its closest cluster centroid by minimizing the within-cluster sum of squares
(WSS), which is:
!
"#$
%
!
&∈()
!
*#$
+
(-&* − /"*)1
where 2" is the set of observations in the kth cluster and /"* is the mean of jth
variable of the cluster center of the kth cluster.
• Then, they select the top n points that are the farthest away from their nearest
cluster centers as outliers.
104

105
Anomaly Detection vs Unsupervised Learning

• “Kmod” package in R is used to show the application of K-means model.
106
In this example the number of clusters is defined
through bend graph in order to pass to K-mod
function.
See Clustering_Approach.R

107
K=4 is the number of clusters and L=10 is
the number of outliers

108
Scatter plots of normal and outlier data points

Case study 2: Anomaly Detection With German
Credit data

Case study 3: Anomaly Detection Auto-Encoder
Decoders

Local Outlier Factor (LOF)
• Local outlier factor (LOF) algorithm first calculates the density of local
neighborhood for each point.
• Then for each object such as p, LOF score is defined as the average of the ratios
of the density of sample p and the density of its nearest neighbors. The number
of nearest neighbors, k, is given by user.
• Points with largest LOF scores are considered as outliers.
• In R, both “DMwR” and “Rlof” packages can be used for performing LOF model.
112

Local Outlier Factor (LOF)
• The LOF scores for outlying points will be high because they are computed in
terms of the ratios to the average neighborhood reachability distances.
• As a result for data points, which distributed homogenously in the cluster, the
LOF scores will be close to one.
• Over a different range of values for k, the maximum LOF score will determine
the scores associated with the local outliers.
113

Local Outlier Factor (R)
• LOF returns a numeric vector of scores for each observation in the data set.
114
k, is the number of neighbors that is used in
calculation of local outlier scores.
See Density_Approach.R
Outlier indexes

115
Local outliers are shown in
red.

116
Histogram of regular observations vs outliers

117
ü Twitter Outlier Detection

Time-series method
• Time-series model is used to identify outliers only in univariate time-series
data.
• In order to apply this model, we use “Anomalydetection” package in R.
• This package was published by twitter for detecting anomalies in time-series
data in the presence of seasonality and an underlying trend using statistical
approaches.
• Since this package uses a specific algorithm to detect anomalies, we go over it
in details in the next slide.

Anomaly detection, R package
• Twitter’s R package: https://github.com/twitter/AnomalyDetection
• Seasonal Hybrid ESD (S-H-ESD), which builds upon the Generalized ESD test, is
the underlying algorithm of this package.
• The algorithm employs time series decomposition and statistical metrics with
ESD test.
• Since the time-series data exhibit a huge variety of pattern, time-series
decomposition, which a statistical method, is used to decompose the data into
its four components.
• The four components are:
1. Trend: refers to the long term progression of the series
2. Cyclical: refers to variations in recognizable cycles
3. Seasonal: refers to seasonal variations or fluctuations
4. Irregular: describes random, irregular influences

120
• The generalized (extreme Studentized deviate) ESD test (Rosner 1983) is used to detect
one or more outliers in a univariate data set that follows an approximately normal
distribution.
• The primary limitation of the Grubbs test is that the suspected number of outliers, k,
must be specified exactly. If k is not specified correctly, this can distort the conclusions
of these tests
• https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
Generalized ESD Test for Outliers

Anomaly detection, R package
121
See TimeSeriesAnomalies.ipynb

122
Anomaly Detection as a service

123
• Also try:
▫ https://github.com/omri374/taganomaly
▫ https://docs.microsoft.com/en-us/azure/machine-learning/studio-
module-reference/anomaly-detection
▫ https://www.youtube.com/watch?v=Ra8HhBLdzHE
Anomaly as a service

Summary
124
We have covered Anomaly detection
Introduction ü Definition of anomaly detection and its importance in energy systems
ü Different types of anomaly detection methods: Statistical, graphical and machine learning methods
Graphical approach ü Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol plot to demonstrate
outliers graphically
ü The main assumption for applying graphical approaches is multivariate normality
ü Mahalanobis distance methods is mainly used for calculating the distance of a point from a center of
multivariate distribution
Statistical approach ü Statistical hypothesis testing includes of: Chi-square, Grubb’s test
ü Statistical methods may use either scores or p-value as threshold to detect outliers
Machine learning
approach
ü Both supervised and unsupervised learning methods can be used for outlier detection
ü Piece wised or segmented regression can be used to identify outliers based on the residuals for each segment
ü In K-means clustering method outliers are defined as points which have doesn’t belong to any cluster, are far
away from the centroids of the cluster or shaping sparse clusters
ü In PCA, Auto-encoder decoder methods, we look at points that weren’t recovered closer to the original points
as anomalies
Density approach ü Local outlier factor algorithm is used to detect local outliers
ü The relative density of a data point is compared the density of it’s k nearest neighbors. K is mainly identified by
user
Time series
methods
ü Temporal outlier detection to detect anomalies which is robust, from a statistical standpoint, in the presence of
seasonality and an underlying trend.

(MATLAB version also available)
www.analyticscertificate.com

126
Contact
info@qusandbox.com
for access to labs

Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
127

Anomaly detection Workshop slides

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Anomaly detection Workshop slides

Ähnlich wie Anomaly detection Workshop slides (20)

Mehr von QuantUniversity

Mehr von QuantUniversity (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Anomaly detection Workshop slides