Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
What you will learn:
Anomaly Detection: An introduction
Graphical and Exploratory analysis techniques
Statistical techniques in Anomaly Detection
Machine learning methods for Outlier analysis
Evaluating performance in Anomaly detection techniques
Detecting anomalies in time series data
Case study 1: Anomalies in Freddie Mac mortgage data
Case study 2: Auto-encoder based Anomaly Detection for Credit risk with Keras and Tensorflow
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Anomaly detection Workshop slides
1. Location:
Qcon.ai Conference
San Francisco
April 15th 2019
Anomaly Detection
Techniques and Best Practices
2019 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com
2. 2
• Introduction
• Applications of Anomaly Detection
• Break – 10.30-10.45am
• Aspects of Anomaly Detection
• Lunch Break : 12.00-1.00pm
• Techniques- Deep Dive
• Break – 2.30-2.45pm
• Labs and Examples
Agenda
4. - Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits
5. • Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Analytics Faculty in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
5
6. 6
Quantitative Analytics and Big Data Analytics Bootcamps
• Analytics Certificate program
• Fintech Certificate program
• Deep Learning & AI boot camp
• Natural Language Processing
workshop
• Machine Learning for Finance
• Machine Learning for Healthcare
applications
See www.analyticscertifcate.com for
current and future offerings
10. What is anomaly detection?
• Anomalies or outliers are data points within the datasets
that appear to deviate markedly from expected outputs.
• An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism1
• Anomaly detection refers to the problem of finding
patterns in data that don’t confirm to expected behavior
10
1. D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.
11. 11
• Outliers are data points that are considered out of the ordinary or
abnormal . This includes noise.
• Anomalies are a special kind of outlier that has significant/
critical/actionable information which could be of interest to
analysts.
Anomaly vs Outliers
1
2
All points not in clusters 1 & 2 are Outliers
Point B is an Anomaly (Both X and Y are large)
12. 12
• Note that it is the analyst’s judgement that determines
what is considered as just an outlier or is an anomaly.
• Most outlier detection methods generate an output that
are:
▫ Real-valued outlier scores: quantifies the tendency of a
data point being an outlier by assigning a score or
probability to it.
▫ Binary labels: result of using a threshold to convert
outlier scores to binary labels, inlier or outlier.
Outlierness
13. 13
• Fraud Detection
▫ Credit card fraud detection
– By owner or by operation
▫ Mobile phone fraud/anomaly detection
– Calling behavior, volume etc.
▫ Insurance claim fraud detection
– Medical malpractice
– Auto insurance
▫ Insider trading detection
• E-commerce
▫ Pricing issues
▫ Network issues
Applications of Anomaly Detection
14. 14
• Intrusion detection:
▫ Detect malicious activity in computer systems
▫ This could be host-based or network-based
• Medical anomalies
Examples of Anomaly Detection
15. 15
• Manufacturing and sensors:
▫ Fault detection
▫ Heat, fire sensors
• Text data
▫ Novel topics, events
▫ Plagiarism
Examples of Anomaly Detection
16. 16
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
17. 17
• By definition, anomaly detection deals with identifying patterns and
points that are not considered normal. This implies that we must
first have a model to define what is normal in our datasets.
• If our model doesn’t capture the nuances of “normal” behavior in
our datasets, our anomaly detection algorithms won’t fare well.
• If our model captures all nuances in our data, we would have overfit
the model and wouldn’t be able to identify anomalies properly
• If our model is too generic, then most points would show up as
anomalies.
• Let’s illustrate this
1. Importance of a defining what is normal
18. 18
• Consider the points below. Our goal is to build a “normal” model
and an anomaly detection model for the following data set.
• y = [2,5,7,8,11,5,3 4 5.8 8]
• Choose a “normal model”.
• Determine the standard deviation and mark any point that is
outside of the 3σ limit as an anomaly
Model assumption
19. 19
• Consider three possible “normal” models
▫ Line A connects all the points : Over fit : No anomalies
▫ Line B is a linear regression line : Poor it : Many points listed as
anomalies
▫ Line C is a polynomial fit of degree 4 : Good fit. One point shown as
anomaly as it is outside of the 3σ band.
This illustrates why the choice of
The “normal” model is critical
Model assumption
20. 20
• We could assume that the points may come from a statistical
distribution. For example from a Gaussian distribution
• We could use an algorithm like k-nearest neighbor to define points
that are close and identify outliers that are at a large distance from
most of the points
• We could use a clustering algorithm to assign membership to clusters
Determining the “normal” model
21. 21
• We will illustrate once more why the choice of the “normal” model
is important.
• We said earlier that the “outlierness” can be quantified as a score.
• One popular score is Z-score.
• A z-score can be calculated from the following formula.
z= (X - μ) / σ
• where z is the z-score, X is the value of the element, μ is the
population mean, and σ is the standard deviation.
• It computes the number of standard deviations by which the data
varies from the mean and used as a proxy to determine outliers.
• However, this works only if the data is from a Gaussian distribution
Importance of a defining what is normal
22. 22
• In the first case, 99.9% of the data is within the 3σ limits. The Z-
score test works well here to detect outliers.
• In the second case, the distribution isn’t normal and is from a Zipf
distribution. Here a Z-score test isn’t valid
Importance of a defining what is normal
23. 23
• It is important for the analyst choose the right representation for
the normal model.
• In the first case, Clustering is useless but a Linear regression is a
good representation.
• In the second case, Clustering is more appropriate.
Importance of a defining what is normal
24. 24
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
25. 25
• Defining the normal region that covers all normal points in the
dataset
• Identifying anomalies masquerading as normal data points or not
being able to uncover anomalies due to a weak choice of the
“normal model”
▫ Example : DOS attack vs DDOS attack
2.0 Challenges when dealing with Anomaly Detection
problems
26. 26
• The evolving “normal behavior” in data
▫ Example:
– $100+ credit card transactions for a student – Average 5 per month
– $100+ credit card transactions for a professional – Average 15 per month
• Data and application dependency
▫ Example :
– For AAPL, a +/- $5 fluctuation in a day is an anomaly.
– For a risky stock, up to +/- $10 fluctuation may be normal and $15 flucation
may be an anomaly
• Lack of labeled data makes it harder to detect anomalies
• Not being able to distinguish noise and anomalies
2.0 Challenges when dealing with Anomaly Detection
problems
27. 27
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
28. 28
• Data objects are usually described by a set of attributes (variables,
features or dimension)
• The term univariate is used when data has one attribute, while
bivariate(two) and multivariate(more that two attributes data)
terms.
• Attributes can be quantitative or qualitative based on their
characteristics.
3.0 Input data characteristics
29. Dataset, variable and Observations
Dataset: A rectangular array with Rows as observations and
columns as variables
Variable: A characteristic of members of a population ( Age, State
etc.)
Observation: List of Variable values for a member of the
population
30. Types of Data
— A variable is numerical if meaningful arithmetic can be
performed on it.
— Discrete vs. Continuous
— Cross-sectional vs. Longitudinal
— Otherwise, the variable is categorical.
— Binary vs Multivalued
— Ordinal vs Nominal
32. Categorical Variables
• Categorical variables can be coded numerically or left uncoded.
• A dummy variable is a 0–1 coded variable for a specific category. It is
coded as 1 for all observations in that category and 0 for all
observations not in that category.
• Categorizing a numerical variable as categorical is called binning
(putting the data into discrete bins) or discretizing.
33. Numerical Data
• Discrete :
▫ How many cities have you lived in?
▫ How many cars pass through a toll booth?
• Continuous:
▫ What’s your height?
▫ What’s the temperature outside?
▫ What was the interest rate in Dec 2004?
35. 35
• In Cross-sectional records, data instances are independent.
• Data instances could be related to each other (typically longitudinal)
• Types of data records that are typically related to each other
▫ Sequence records:
– Time-series data – temporal continuity
– Genome and protein sequences
▫ Spatial Data : Data is related to neighbors; Spatial continuity
– Traffic data
– Weather data
– Ecological data
▫ Graph data
– Social media data
Relationship between records
36. 36
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
37. 37
Anomalies can be classified into three major categories
1. Point Anomalies
– In an instance is anomalous compared with the rest of instances, the anomaly
is considered a point anomaly
2. Contextual Anomalies
– If an instance is anomalous in a specific context, the anomaly would be
considered as a contextual anomaly
3. Collective Anomalies
– If a collection of related data records are anomalous with respect to the entire
data set, the anomaly is a collective anomaly
4.0 Anomaly Classification
38. 38
• In the figure, points o1 and o2 are considered point anomalies
• Examples:
▫ A 50% increase in daily stock price
▫ A credit card transaction attempt for $5000 (assuming you have never
had a single transaction for anything above $1000)
Point Anomalies
39. 39
• In the figure, temperature t2 is an anomaly
• Note that t1 is lower than t2 but contextually, t1 is expected and t2
isn’t when compared to records around it.
Contextual Anomalies
40. 40
• Multiple Buy Stock transactions and then a sequence of Sell transactions
around an earnings release date may be anomalous and may indicate
insider trading.
• Consider the sequence of network activities recorded
• Though ssh, buffer-overflow and ftp themselves are not anomalous
activities, a sequence of the three indicates a web-based attack
• Similarly, multiple http requests from an ip address may indicate a
crawler in action.
See https://www.bloomberg.com/graphics/2019-etf-tax-dodge-lets-
investors-save-big/
Collective Anomalies
41. 41
• In medicine, abnormal ECG pattern detection would involve looking
for collective anomalies like Premature Atrial Contraction
Collective Anomalies
http://www.fprmed.com/Pages/Cardio/PAC.html
42. 42
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
43. 43
• The goal of an Outlier detection or Anomaly Detection algorithm is
to identify if there are anomalies in the data. The outputs would be
of the form:
▫ Scores : A number generated by the algorithm for each record.
Typically, the scores are sorted and a threshold chosen to designate
anomalies
▫ Labels : Here the algorithm takes a binary decision on whether the
algorithm is an anomaly or not
5. Outputs of Anomaly Detection algorithms
44. 44
• Based on whether the data is labeled or not, machine learning
algorithms can be used for anomaly detection
• If the historical data is labeled (anomaly/not), supervised techniques
can be used
• If the historical data isn’t labeled, unsupervised algorithms can be
used to figure out if the data is normal/anomalous
5. Outputs of Anomaly Detection algorithms
45. 45
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
46. 46
1. Extreme value analysis
2. Classification-based techniques
3. Statistical techniques
i. Parametric techniques
a. Gaussian model-based models
b. Regression-based models
c. Mixture models
ii. Non-Parametric techniques
a. Histogram-based models
b. Kernel-based models
A tour of Anomaly Detection Techniques
47. 47
4. Proximity-based models
i. Cluster analysis
ii. Nearest neighbor analysis
5. Information theoretic models
6. Meta-algorithms and ensemble techniques
i. Sequential ensembles
ii. Independent ensembles
A tour of Anomaly Detection Techniques
48. 48
• Assumption : Anomalies are extreme values in the data set
• Goal: Determine the statistical tails of the underlying distribution
• Data : Univariate cross-sectional data
• Examples:
▫ Z-score test for a dataset which is assumed to be Normal
▫ Grubb’s test
▫ Using Box-plots to detect anomalies
1. Extreme value Analysis
49. 49
• Assumption : Available labeled data
• Goal: Build a classifier that can distinguish between normal and
anomalous data
• Data : Multi-dimensional cross-sectional data ( Numeric/categorical)
• Examples:
▫ Rule-based (Decision trees)
▫ Neural networks
▫ SVM
2. Classification based techniques
50. 50
• Statistical techniques fit a statistical model to a given data and
applies a statistical inference test to determine if an unseen instance
belongs to this model or not.
• Based on the assumed statistical model that describes the data,
anomalies are data points that are assumed to have not been
generated by the model
3. Statistical techniques
51. 51
• Assumption : The underlying distribution is known and the
parameters for the distribution can be estimated
• Goal: Infer if a data point belongs to the distribution or not.
• Data : Depends on the technique
• Techniques:
▫ Gaussian techniques
– IQR test (Box plots)
– The region between Q1 −1.5*QR and Q3 +1.5*QR contains 99.3% of observations,
and hence the choice of the 1.5IQR boundary makes the box plot rule equivalent
to the 3σ technique for Gaussian data.
– Grubb’s test (later)
– Chi-squared test (later)
3. Statistical techniques - Parametric
52. 52
• Regression based techniques
– Here the goal is to model the data into a lower dimensional sub-space
using linear correlations.
– i.e. summarize data in to a model parameterized by the coefficients and
constant.
– Step 1: Build a regression model with various features
– Step 2: Review residuals ; The magnitude of the residuals indicate
anomalies.
3. Statistical techniques - Parametric
53. 53
• Mixture models
– Example: Gaussian mixture model
– Here, the data is characterized by a process that is a mixture of Gaussian clusters.
– The parameters are estimated using an EM algorithm
– The goal is to determine the probability of data points being in different clusters.
– Anomalies would have low probability values
3. Statistical techniques - Parametric
54. 54
• Assumption : The data’s distribution isn’t known apriori
• Goal: Infer if a data point belongs to the assumed normal model
• Data : Depends on the technique
• Techniques:
▫ Histogram
– Count/Frequency based: Create histogram. Bins with very few points
indicate anomalies
▫ Kernel-based
– Using density-estimation techniques, build kernel functions and estimate
probability-distribution function(pdf) for normal instances. Instances lying
in the low-probability areas termed anomalies
3. Statistical techniques - Nonparametric
56. 56
• Assumption : Anomalous points are isolated from the rest of the
data
• Goal: Segment the points/space with a goal of identifying
anomalies.
• Data : Typically, multi-dimensional cross-sectional data
• Methods:
▫ Clustering : Unsupervised techniques to group data into clusters using
distances/densities depending on the technique. Anomalies belong to
sparse clusters/ no clusters and are typically far off from “normal”
clusters.
▫ Examples : K-means
4.0 Proximity-based techniques
57. 57
▫ Nearest neighbor techniques: Here, it is assumed that normal points
occur in dense neighborhoods and anomalies are far from neighbors
▫ Here distances/relative densities are used to determine neighborhoods
▫ Examples :
– KNN algorithm : Anomaly score of a data instance is defined as its distance
to it’s kth nearest neighbor in a given data set.
– Local Outlier Factor scores : The anomaly score (LOF) is equal to the ratio
of average local density of the k-nearest neighbors of the instance and the
local density of the data instance itself.
4.0 Proximity-based techniques
58. 58
• Assumption : Anomalies induce irregularities in the information
content of the data set that increase information summaries
• Goal: Identify data that can’t be summarized into a lower
dimensional space efficiently
• Data : Typically, multi-dimensional cross-sectional data
• Example:
• The first line can be summarized as AB -17 times
• With C present in the second line, the second line can no longer be
succinctly summarized
5. Information theoretic models
59. 59
• Assumption : Using multiple algorithms would help increase the
robustness of the anomaly detection algorithm
• Goal: Use ensembles to enhance the quality of anomaly detection
• Data : Typically, multi-dimensional cross-sectional data
• Methods:
▫ Sequential ensembles : A given algorithm or a sequence of algorithms
are applied sequentially. Example: Boosting methods for classification
▫ Independent ensembles: Different algorithms or different instances of
the same algorithm are run and results combined to detect robust
outliers.
6. Meta-algorithms and sequential techniques
60. 60
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
61. 61
• For Unsupervised cases, hard as data
isn’t labeled
• For Supervised learning, ROC curve
• The true-positive rate is also known
as sensitivity, or recall in machine
learning.
• The false-positive rate is also known
as the fall-out and can be calculated
as (1 - specificity).
• The ROC curve is thus the sensitivity
as a function of fall-out.
7.0 Performance evaluation
62. 62
• For Unsupervised cases, hard as data
isn’t labeled
• For Supervised learning, ROC curve
• The true-positive rate is also known
as sensitivity, or recall in machine
learning.
• The false-positive rate is also known
as the fall-out and can be calculated
as (1 - specificity).
• The ROC curve is thus the sensitivity
as a function of fall-out.
7.0 Performance evaluation
63. 63
• F-1 score considers both the precision p and the recall r of the test
to compute the score.
• The F1 score is the harmonic average of the precision and recall,
where an F1 score reaches its best value at 1 (perfect precision and
recall) and worst at 0
F-1 score
65. 65
1. Graphical approach
2. Statistical approach
3. Machine learning approach
4. Density based approach
5. Time series approach
Illustration of five methodologies to Anomaly Detection
67. Graphical approaches
• Graphical methods utilize extreme value analysis, by which outliers
correspond to the statistical tails of probability distributions.
• Statistical tails are most commonly used for one dimensional
distributions, although the same concept can be applied to
multidimensional case.
• It is important to understand that all extreme values are outliers
but the reverse may not be true.
• For instance in one dimensional dataset of
{1,3,3,3,50,97,97,97,100}, observation 50 equals to mean and isn’t
considered as an extreme value, but since this observation is the
most isolated point, it should be considered as an outlier.
67
68. Box plot
• A standardized way of displaying the
variation of data based on the five
number summary, which includes
minimum, first quartile, median, third
quartile, and maximum.
• This plot does not make any assumptions
of the underlying statistical distribution.
• Any data not included between the
minimum and maximum are considered
as an outlier.
68
70. Scatter plot
• A mathematical diagram, which uses Cartesian coordinates for plotting ordered
pairs to show the correlation between typically two random variables.
• An outlier is defined as a data point that doesn't seem to fit with the rest of the
data points.
• In scatterplots, outliers of either intersection or union sets of two variables can
be shown.
70
72. 72
• In statistics, a Q–Q plot is a probability plot, which is a graphical
method for comparing two probability distributions by plotting their
quantiles against each other.
• If the two distributions being compared are similar, the points in the
Q–Q plot will approximately lie on the line y = x.
Q-Q plot
Source: Wikipedia
73. Adjusted quantile plot
• This plot identifies possible multivariate outliers by calculating the Mahalanobis
distance of each point from the center of the data.
• Multi-dimensional Mahalanobis distance between vectors x and y in !" can be
formulated as:
d(x,y) = x − y ,S./(x − y)
where x and y are random vectors of the same distribution with the covariance
matrix S.
• An outlier is defined as a point with a distance larger than some pre-determined
value.
73
74. Adjusted quantile plot
• Before applying this method and many other parametric
multivariate methods, first we need to check if the data is
multivariate normally distributed using different
multivariate normality tests, such as Royston, Mardia, Chi-
square, univariate plots, etc.
• In R, we use the “mvoutlier” package, which utilizes
graphical approaches as discussed above.
74
75. Adjusted quantile plot
75
Min-Max normalization before diving into analysis
Multivariate normality test
Outlier Boolean vector identifies the
outliers
Alpha defines maximum thresholding proportion
See Graphical_Approach.R
78. Symbol plot
• This plot plots two dimensional data, using robust Mahalanobis distances based
on the minimum covariance determinant(mcd) estimator with adjustment.
• Minimum Covariance Determinant (MCD) estimator looks for the subset of h
data points whose covariance matrix has the smallest determinant.
• Four drawn ellipsoids in the plot show the Mahalanobis distances correspond to
25%, 50%, 75% and adjusted quantiles of the chi-square distribution.
78
79. Symbol plot
79
See Graphical_Approach.R
Parameter “quan” defines the amount of observations,
which are used for minimum covariance determinant
estimations. The default is 0.5.
Alpha defines the amount of observations used for
calculating the adjusted quantile.
80. Case study 1: Anomaly Detection With Freddie
Mac Data
2016 Copyright QuantUniversity LLC.
82. Hypothesis testing
• This method draws conclusions about a sample point by testing whether it
comes from the same distribution as the training data.
• Statistical tests, such as the t-test and the ANOVA table, can be used on multiple
subsets of the data.
• Here, the level of significance, i.e, the probability of incorrectly rejecting the
true null hypothesis, needs to be chosen.
• To apply this method in R, “outliers” package, which utilizes statistical
tests, is used .
82
83. Chi-square test
• Chi-square test performs a simple test for detecting outliers of univariate data
based on Chi-square distribution of squared difference between data and
sample mean.
• In this test, sample variance counts as the estimator of the population variance.
• Chi-square test helps us identify the lowest and highest values, since outliers
can exist in both tails of the data.
83
84. 84
When an analyst attempts to fit a statistical model to observed data, he or she may wonder how well the model actually
reflects the data. How "close" are the observed values to those which would be expected under the fitted model? One
statistical test that addresses this issue is the chi-square goodness of fit test.
This test is commonly used to test association of variables in two-way tables where the assumed model of independence is
evaluated against the observed data. In general, the chi-square test statistic is of the form
.
If the computed test statistic is large, then the observed and expected values are not close and the model is a poor fit to the
data (anomaly).
Chi-square test
86. Grubbs’ test
• Test for outliers for univariate data sets assumed to come from a normally
distributed population.
• Grubbs' test detects one outlier at a time. This outlier is expunged from the
dataset and the test is iterated until no outliers are detected.
• This test is defined for the following hypotheses:
H0: There are no outliers in the data set
H1: There is exactly one outlier in the data set
• The Grubbs' test statistic is defined as:
86
89. Scores
• Scores quantifies the tendency of a data point being an outlier by assigning it a
score or probability.
• The most commonly used scores are:
▫ Normal score:
!" #$%&'
()&'*&+* *%,-&)-.'
▫ T-student score:
(0#(1+) '#2 )
(1+)(0#4#)5)
▫ Chi-square score:
!" #$%&'
(*
2
▫ IQR score: 67-64
• By using “score” function in R, p-values can be returned instead of scores.
89
91. Scores
91
See Statistical_Approach.R
By setting “prob” to any specific value, logical vector
returns the data points, whose probabilities are
greater than this cut-off value, as outliers.
By setting “type” to IQR, all values lower than first
and greater than third quartiles are considered and
difference between them and nearest quartile
divided by IQR is calculated.
92. 92
ü Linear regression
ü Piecewise/ segmented regression
ü Autoencoder-Decoder
ü Clustering-based approaches
93. Linear regression
• Linear regression investigates the linear relationships between variables and
predict one variable based on one or more other variables and it can be
formulated as:
! = #$ + &
'()
*
#'+'
where Y and +' are random variables, #' is regression coefficient and #$ is a
constant.
• In this model, ordinary least squares estimator is usually used to minimize the
difference between the dependent variable and independent variables.
93
94. Piecewise/segmented regression
• A method in regression analysis, in which the independent variable is
partitioned into intervals to allow multiple linear models to be fitted to data for
different ranges.
• This model can be applied when there are ‘breakpoints’ and clearly two
different linear relationships in the data with a sudden, sharp change in
directionality. Below is a simple segmented regression for data with two
breakpoints:
! = #$ + &'( ( < ('
! = #' + &*( ( > ('
where Y is a predicted value, X is an independent variable, #$ and #' are
constant values, &' and &* are regression coefficients, and (' and (* are
breakpoints.
94
96. Piecewise/segmented regression
• For this example, we use “segmented” package in R to first illustrate piecewise
regression for two dimensional data set, which has a breakpoint around z=0.5.
96
See Piecewise_Regression.R
“pmax” is used for parallel maximization to
create different values for y.
98. Piecewise/segmented regression
• Finally, the outliers can be detected for each segment by setting some rules for
residuals of model.
98
See Piecewise_Regression.R
Here, we set the rule for the residuals corresponding to z
less than 0.5, by which the outliers with residuals below
0.5 can be defined as outliers.
100. 100
• Goal is to have !" to approximate x
• Interesting applications such as
▫ Data compression
▫ Visualization
▫ Pre-train neural networks
Autoencoder
101. 101
Demo in Keras1
1. https://blog.keras.io/building-autoencoders-in-keras.html
2. https://keras.io/models/model/
102. 102
Principal Component Analysis
Principal component analysis (PCA) is a statistical
procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated
variables (entities each of which takes on various
numerical values) into a set of values of linearly
uncorrelated variables called principal components.
In Outlier analysis, we do principal component
analysis and compute p-values to test for outliers.
https://en.wikipedia.org/wiki/Principal_component_analysis
103. Clustering-based approaches
• These methods are suitable for unsupervised anomaly detection.
• They aim to partition the data into meaningful groups (clusters) based on the
similarities and relationships between the groups found in the data.
• Each data point is assigned a degree of membership for each of the clusters.
• Anomalies are those data points that:
▫ Do not fit into any clusters.
▫ Belong to a particular cluster but are far away from the cluster centroid.
▫ Form small or sparse clusters.
103
104. Clustering-based approaches
• These methods partition the data into k clusters by assigning each data point to
its closest cluster centroid by minimizing the within-cluster sum of squares
(WSS), which is:
!
"#$
%
!
&∈()
!
*#$
+
(-&* − /"*)1
where 2" is the set of observations in the kth cluster and /"* is the mean of jth
variable of the cluster center of the kth cluster.
• Then, they select the top n points that are the farthest away from their nearest
cluster centers as outliers.
104
106. Clustering-based approaches
• “Kmod” package in R is used to show the application of K-means model.
106
In this example the number of clusters is defined
through bend graph in order to pass to K-mod
function.
See Clustering_Approach.R
112. Local Outlier Factor (LOF)
• Local outlier factor (LOF) algorithm first calculates the density of local
neighborhood for each point.
• Then for each object such as p, LOF score is defined as the average of the ratios
of the density of sample p and the density of its nearest neighbors. The number
of nearest neighbors, k, is given by user.
• Points with largest LOF scores are considered as outliers.
• In R, both “DMwR” and “Rlof” packages can be used for performing LOF model.
112
113. Local Outlier Factor (LOF)
• The LOF scores for outlying points will be high because they are computed in
terms of the ratios to the average neighborhood reachability distances.
• As a result for data points, which distributed homogenously in the cluster, the
LOF scores will be close to one.
• Over a different range of values for k, the maximum LOF score will determine
the scores associated with the local outliers.
113
114. Local Outlier Factor (R)
• LOF returns a numeric vector of scores for each observation in the data set.
114
k, is the number of neighbors that is used in
calculation of local outlier scores.
See Density_Approach.R
Outlier indexes
115. Local Outlier Factor (R)
115
Local outliers are shown in
red.
See Density_Approach.R
116. 116
Local Outlier Factor (R)
Histogram of regular observations vs outliers
See Density_Approach.R
118. Time-series method
• Time-series model is used to identify outliers only in univariate time-series
data.
• In order to apply this model, we use “Anomalydetection” package in R.
• This package was published by twitter for detecting anomalies in time-series
data in the presence of seasonality and an underlying trend using statistical
approaches.
• Since this package uses a specific algorithm to detect anomalies, we go over it
in details in the next slide.
119. Anomaly detection, R package
• Twitter’s R package: https://github.com/twitter/AnomalyDetection
• Seasonal Hybrid ESD (S-H-ESD), which builds upon the Generalized ESD test, is
the underlying algorithm of this package.
• The algorithm employs time series decomposition and statistical metrics with
ESD test.
• Since the time-series data exhibit a huge variety of pattern, time-series
decomposition, which a statistical method, is used to decompose the data into
its four components.
• The four components are:
1. Trend: refers to the long term progression of the series
2. Cyclical: refers to variations in recognizable cycles
3. Seasonal: refers to seasonal variations or fluctuations
4. Irregular: describes random, irregular influences
120. 120
• The generalized (extreme Studentized deviate) ESD test (Rosner 1983) is used to detect
one or more outliers in a univariate data set that follows an approximately normal
distribution.
• The primary limitation of the Grubbs test is that the suspected number of outliers, k,
must be specified exactly. If k is not specified correctly, this can distort the conclusions
of these tests
• https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
Generalized ESD Test for Outliers
123. 123
• Also try:
▫ https://github.com/omri374/taganomaly
▫ https://docs.microsoft.com/en-us/azure/machine-learning/studio-
module-reference/anomaly-detection
▫ https://www.youtube.com/watch?v=Ra8HhBLdzHE
Anomaly as a service
124. Summary
124
We have covered Anomaly detection
Introduction ü Definition of anomaly detection and its importance in energy systems
ü Different types of anomaly detection methods: Statistical, graphical and machine learning methods
Graphical approach ü Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol plot to demonstrate
outliers graphically
ü The main assumption for applying graphical approaches is multivariate normality
ü Mahalanobis distance methods is mainly used for calculating the distance of a point from a center of
multivariate distribution
Statistical approach ü Statistical hypothesis testing includes of: Chi-square, Grubb’s test
ü Statistical methods may use either scores or p-value as threshold to detect outliers
Machine learning
approach
ü Both supervised and unsupervised learning methods can be used for outlier detection
ü Piece wised or segmented regression can be used to identify outliers based on the residuals for each segment
ü In K-means clustering method outliers are defined as points which have doesn’t belong to any cluster, are far
away from the centroids of the cluster or shaping sparse clusters
ü In PCA, Auto-encoder decoder methods, we look at points that weren’t recovered closer to the original points
as anomalies
Density approach ü Local outlier factor algorithm is used to detect local outliers
ü The relative density of a data point is compared the density of it’s k nearest neighbors. K is mainly identified by
user
Time series
methods
ü Temporal outlier detection to detect anomalies which is robust, from a statistical standpoint, in the presence of
seasonality and an underlying trend.
127. Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
127