SlideShare ist ein Scribd-Unternehmen logo
1 von 127
Location:
Qcon.ai Conference
San Francisco
April 15th 2019
Anomaly Detection
Techniques and Best Practices
2019 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com
2
• Introduction
• Applications of Anomaly Detection
• Break – 10.30-10.45am
• Aspects of Anomaly Detection
• Lunch Break : 12.00-1.00pm
• Techniques- Deep Dive
• Break – 2.30-2.45pm
• Labs and Examples
Agenda
3
- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits
• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Analytics Faculty in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
5
6
Quantitative Analytics and Big Data Analytics Bootcamps
• Analytics Certificate program
• Fintech Certificate program
• Deep Learning & AI boot camp
• Natural Language Processing
workshop
• Machine Learning for Finance
• Machine Learning for Healthcare
applications
See www.analyticscertifcate.com for
current and future offerings
(MATLAB version also available)
8
9
What is anomaly detection?
• Anomalies or outliers are data points within the datasets
that appear to deviate markedly from expected outputs.
• An outlier is an observation which deviates so much from the other
observations as to arouse suspicions that it was generated by a
different mechanism1
• Anomaly detection refers to the problem of finding
patterns in data that don’t confirm to expected behavior
10
1. D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.
11
• Outliers are data points that are considered out of the ordinary or
abnormal . This includes noise.
• Anomalies are a special kind of outlier that has significant/
critical/actionable information which could be of interest to
analysts.
Anomaly vs Outliers
1
2
All points not in clusters 1 & 2 are Outliers
Point B is an Anomaly (Both X and Y are large)
12
• Note that it is the analyst’s judgement that determines
what is considered as just an outlier or is an anomaly.
• Most outlier detection methods generate an output that
are:
▫ Real-valued outlier scores: quantifies the tendency of a
data point being an outlier by assigning a score or
probability to it.
▫ Binary labels: result of using a threshold to convert
outlier scores to binary labels, inlier or outlier.
Outlierness
13
• Fraud Detection
▫ Credit card fraud detection
– By owner or by operation
▫ Mobile phone fraud/anomaly detection
– Calling behavior, volume etc.
▫ Insurance claim fraud detection
– Medical malpractice
– Auto insurance
▫ Insider trading detection
• E-commerce
▫ Pricing issues
▫ Network issues
Applications of Anomaly Detection
14
• Intrusion detection:
▫ Detect malicious activity in computer systems
▫ This could be host-based or network-based
• Medical anomalies
Examples of Anomaly Detection
15
• Manufacturing and sensors:
▫ Fault detection
▫ Heat, fire sensors
• Text data
▫ Novel topics, events
▫ Plagiarism
Examples of Anomaly Detection
16
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
17
• By definition, anomaly detection deals with identifying patterns and
points that are not considered normal. This implies that we must
first have a model to define what is normal in our datasets.
• If our model doesn’t capture the nuances of “normal” behavior in
our datasets, our anomaly detection algorithms won’t fare well.
• If our model captures all nuances in our data, we would have overfit
the model and wouldn’t be able to identify anomalies properly
• If our model is too generic, then most points would show up as
anomalies.
• Let’s illustrate this
1. Importance of a defining what is normal
18
• Consider the points below. Our goal is to build a “normal” model
and an anomaly detection model for the following data set.
• y = [2,5,7,8,11,5,3 4 5.8 8]
• Choose a “normal model”.
• Determine the standard deviation and mark any point that is
outside of the 3σ limit as an anomaly
Model assumption
19
• Consider three possible “normal” models
▫ Line A connects all the points : Over fit : No anomalies
▫ Line B is a linear regression line : Poor it : Many points listed as
anomalies
▫ Line C is a polynomial fit of degree 4 : Good fit. One point shown as
anomaly as it is outside of the 3σ band.
This illustrates why the choice of
The “normal” model is critical
Model assumption
20
• We could assume that the points may come from a statistical
distribution. For example from a Gaussian distribution
• We could use an algorithm like k-nearest neighbor to define points
that are close and identify outliers that are at a large distance from
most of the points
• We could use a clustering algorithm to assign membership to clusters
Determining the “normal” model
21
• We will illustrate once more why the choice of the “normal” model
is important.
• We said earlier that the “outlierness” can be quantified as a score.
• One popular score is Z-score.
• A z-score can be calculated from the following formula.
z= (X - μ) / σ
• where z is the z-score, X is the value of the element, μ is the
population mean, and σ is the standard deviation.
• It computes the number of standard deviations by which the data
varies from the mean and used as a proxy to determine outliers.
• However, this works only if the data is from a Gaussian distribution
Importance of a defining what is normal
22
• In the first case, 99.9% of the data is within the 3σ limits. The Z-
score test works well here to detect outliers.
• In the second case, the distribution isn’t normal and is from a Zipf
distribution. Here a Z-score test isn’t valid
Importance of a defining what is normal
23
• It is important for the analyst choose the right representation for
the normal model.
• In the first case, Clustering is useless but a Linear regression is a
good representation.
• In the second case, Clustering is more appropriate.
Importance of a defining what is normal
24
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
25
• Defining the normal region that covers all normal points in the
dataset
• Identifying anomalies masquerading as normal data points or not
being able to uncover anomalies due to a weak choice of the
“normal model”
▫ Example : DOS attack vs DDOS attack
2.0 Challenges when dealing with Anomaly Detection
problems
26
• The evolving “normal behavior” in data
▫ Example:
– $100+ credit card transactions for a student – Average 5 per month
– $100+ credit card transactions for a professional – Average 15 per month
• Data and application dependency
▫ Example :
– For AAPL, a +/- $5 fluctuation in a day is an anomaly.
– For a risky stock, up to +/- $10 fluctuation may be normal and $15 flucation
may be an anomaly
• Lack of labeled data makes it harder to detect anomalies
• Not being able to distinguish noise and anomalies
2.0 Challenges when dealing with Anomaly Detection
problems
27
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
28
• Data objects are usually described by a set of attributes (variables,
features or dimension)
• The term univariate is used when data has one attribute, while
bivariate(two) and multivariate(more that two attributes data)
terms.
• Attributes can be quantitative or qualitative based on their
characteristics.
3.0 Input data characteristics
Dataset, variable and Observations
Dataset: A rectangular array with Rows as observations and
columns as variables
Variable: A characteristic of members of a population ( Age, State
etc.)
Observation: List of Variable values for a member of the
population
Types of Data
— A variable is numerical if meaningful arithmetic can be
performed on it.
— Discrete vs. Continuous
— Cross-sectional vs. Longitudinal
— Otherwise, the variable is categorical.
— Binary vs Multivalued
— Ordinal vs Nominal
Categorial Variables
• Ordinal: natural ordering of its possible values.
• Nominal : no natural ordering
Categorical Variables
• Categorical variables can be coded numerically or left uncoded.
• A dummy variable is a 0–1 coded variable for a specific category. It is
coded as 1 for all observations in that category and 0 for all
observations not in that category.
• Categorizing a numerical variable as categorical is called binning
(putting the data into discrete bins) or discretizing.
Numerical Data
• Discrete :
▫ How many cities have you lived in?
▫ How many cars pass through a toll booth?
• Continuous:
▫ What’s your height?
▫ What’s the temperature outside?
▫ What was the interest rate in Dec 2004?
Numerical data
• Longitudinal
• Cross-sectional
35
• In Cross-sectional records, data instances are independent.
• Data instances could be related to each other (typically longitudinal)
• Types of data records that are typically related to each other
▫ Sequence records:
– Time-series data – temporal continuity
– Genome and protein sequences
▫ Spatial Data : Data is related to neighbors; Spatial continuity
– Traffic data
– Weather data
– Ecological data
▫ Graph data
– Social media data
Relationship between records
36
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
37
Anomalies can be classified into three major categories
1. Point Anomalies
– In an instance is anomalous compared with the rest of instances, the anomaly
is considered a point anomaly
2. Contextual Anomalies
– If an instance is anomalous in a specific context, the anomaly would be
considered as a contextual anomaly
3. Collective Anomalies
– If a collection of related data records are anomalous with respect to the entire
data set, the anomaly is a collective anomaly
4.0 Anomaly Classification
38
• In the figure, points o1 and o2 are considered point anomalies
• Examples:
▫ A 50% increase in daily stock price
▫ A credit card transaction attempt for $5000 (assuming you have never
had a single transaction for anything above $1000)
Point Anomalies
39
• In the figure, temperature t2 is an anomaly
• Note that t1 is lower than t2 but contextually, t1 is expected and t2
isn’t when compared to records around it.
Contextual Anomalies
40
• Multiple Buy Stock transactions and then a sequence of Sell transactions
around an earnings release date may be anomalous and may indicate
insider trading.
• Consider the sequence of network activities recorded
• Though ssh, buffer-overflow and ftp themselves are not anomalous
activities, a sequence of the three indicates a web-based attack
• Similarly, multiple http requests from an ip address may indicate a
crawler in action.
See https://www.bloomberg.com/graphics/2019-etf-tax-dodge-lets-
investors-save-big/
Collective Anomalies
41
• In medicine, abnormal ECG pattern detection would involve looking
for collective anomalies like Premature Atrial Contraction
Collective Anomalies
http://www.fprmed.com/Pages/Cardio/PAC.html
42
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
43
• The goal of an Outlier detection or Anomaly Detection algorithm is
to identify if there are anomalies in the data. The outputs would be
of the form:
▫ Scores : A number generated by the algorithm for each record.
Typically, the scores are sorted and a threshold chosen to designate
anomalies
▫ Labels : Here the algorithm takes a binary decision on whether the
algorithm is an anomaly or not
5. Outputs of Anomaly Detection algorithms
44
• Based on whether the data is labeled or not, machine learning
algorithms can be used for anomaly detection
• If the historical data is labeled (anomaly/not), supervised techniques
can be used
• If the historical data isn’t labeled, unsupervised algorithms can be
used to figure out if the data is normal/anomalous
5. Outputs of Anomaly Detection algorithms
45
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
46
1. Extreme value analysis
2. Classification-based techniques
3. Statistical techniques
i. Parametric techniques
a. Gaussian model-based models
b. Regression-based models
c. Mixture models
ii. Non-Parametric techniques
a. Histogram-based models
b. Kernel-based models
A tour of Anomaly Detection Techniques
47
4. Proximity-based models
i. Cluster analysis
ii. Nearest neighbor analysis
5. Information theoretic models
6. Meta-algorithms and ensemble techniques
i. Sequential ensembles
ii. Independent ensembles
A tour of Anomaly Detection Techniques
48
• Assumption : Anomalies are extreme values in the data set
• Goal: Determine the statistical tails of the underlying distribution
• Data : Univariate cross-sectional data
• Examples:
▫ Z-score test for a dataset which is assumed to be Normal
▫ Grubb’s test
▫ Using Box-plots to detect anomalies
1. Extreme value Analysis
49
• Assumption : Available labeled data
• Goal: Build a classifier that can distinguish between normal and
anomalous data
• Data : Multi-dimensional cross-sectional data ( Numeric/categorical)
• Examples:
▫ Rule-based (Decision trees)
▫ Neural networks
▫ SVM
2. Classification based techniques
50
• Statistical techniques fit a statistical model to a given data and
applies a statistical inference test to determine if an unseen instance
belongs to this model or not.
• Based on the assumed statistical model that describes the data,
anomalies are data points that are assumed to have not been
generated by the model
3. Statistical techniques
51
• Assumption : The underlying distribution is known and the
parameters for the distribution can be estimated
• Goal: Infer if a data point belongs to the distribution or not.
• Data : Depends on the technique
• Techniques:
▫ Gaussian techniques
– IQR test (Box plots)
– The region between Q1 −1.5*QR and Q3 +1.5*QR contains 99.3% of observations,
and hence the choice of the 1.5IQR boundary makes the box plot rule equivalent
to the 3σ technique for Gaussian data.
– Grubb’s test (later)
– Chi-squared test (later)
3. Statistical techniques - Parametric
52
• Regression based techniques
– Here the goal is to model the data into a lower dimensional sub-space
using linear correlations.
– i.e. summarize data in to a model parameterized by the coefficients and
constant.
– Step 1: Build a regression model with various features
– Step 2: Review residuals ; The magnitude of the residuals indicate
anomalies.
3. Statistical techniques - Parametric
53
• Mixture models
– Example: Gaussian mixture model
– Here, the data is characterized by a process that is a mixture of Gaussian clusters.
– The parameters are estimated using an EM algorithm
– The goal is to determine the probability of data points being in different clusters.
– Anomalies would have low probability values
3. Statistical techniques - Parametric
54
• Assumption : The data’s distribution isn’t known apriori
• Goal: Infer if a data point belongs to the assumed normal model
• Data : Depends on the technique
• Techniques:
▫ Histogram
– Count/Frequency based: Create histogram. Bins with very few points
indicate anomalies
▫ Kernel-based
– Using density-estimation techniques, build kernel functions and estimate
probability-distribution function(pdf) for normal instances. Instances lying
in the low-probability areas termed anomalies
3. Statistical techniques - Nonparametric
55
Example:
Check https://www.r-bloggers.com/exploratory-data-analysis-combining-histograms-and-density-
plots-to-examine-the-distribution-of-the-ozone-pollution-data-from-new-york-in-r/
56
• Assumption : Anomalous points are isolated from the rest of the
data
• Goal: Segment the points/space with a goal of identifying
anomalies.
• Data : Typically, multi-dimensional cross-sectional data
• Methods:
▫ Clustering : Unsupervised techniques to group data into clusters using
distances/densities depending on the technique. Anomalies belong to
sparse clusters/ no clusters and are typically far off from “normal”
clusters.
▫ Examples : K-means
4.0 Proximity-based techniques
57
▫ Nearest neighbor techniques: Here, it is assumed that normal points
occur in dense neighborhoods and anomalies are far from neighbors
▫ Here distances/relative densities are used to determine neighborhoods
▫ Examples :
– KNN algorithm : Anomaly score of a data instance is defined as its distance
to it’s kth nearest neighbor in a given data set.
– Local Outlier Factor scores : The anomaly score (LOF) is equal to the ratio
of average local density of the k-nearest neighbors of the instance and the
local density of the data instance itself.
4.0 Proximity-based techniques
58
• Assumption : Anomalies induce irregularities in the information
content of the data set that increase information summaries
• Goal: Identify data that can’t be summarized into a lower
dimensional space efficiently
• Data : Typically, multi-dimensional cross-sectional data
• Example:
• The first line can be summarized as AB -17 times
• With C present in the second line, the second line can no longer be
succinctly summarized
5. Information theoretic models
59
• Assumption : Using multiple algorithms would help increase the
robustness of the anomaly detection algorithm
• Goal: Use ensembles to enhance the quality of anomaly detection
• Data : Typically, multi-dimensional cross-sectional data
• Methods:
▫ Sequential ensembles : A given algorithm or a sequence of algorithms
are applied sequentially. Example: Boosting methods for classification
▫ Independent ensembles: Different algorithms or different instances of
the same algorithm are run and results combined to detect robust
outliers.
6. Meta-algorithms and sequential techniques
60
1. Importance of a defining what is normal
2. Challenges when dealing with Anomaly Detection problems
3. Input Data Characteristics
4. Anomaly Classification
5. Outputs of Anomaly Detection
6. Techniques used for Anomaly Detection
7. Performance evaluation
Aspects of Anomaly detection problems
61
• For Unsupervised cases, hard as data
isn’t labeled
• For Supervised learning, ROC curve
• The true-positive rate is also known
as sensitivity, or recall in machine
learning.
• The false-positive rate is also known
as the fall-out and can be calculated
as (1 - specificity).
• The ROC curve is thus the sensitivity
as a function of fall-out.
7.0 Performance evaluation
62
• For Unsupervised cases, hard as data
isn’t labeled
• For Supervised learning, ROC curve
• The true-positive rate is also known
as sensitivity, or recall in machine
learning.
• The false-positive rate is also known
as the fall-out and can be calculated
as (1 - specificity).
• The ROC curve is thus the sensitivity
as a function of fall-out.
7.0 Performance evaluation
63
• F-1 score considers both the precision p and the recall r of the test
to compute the score.
• The F1 score is the harmonic average of the precision and recall,
where an F1 score reaches its best value at 1 (perfect precision and
recall) and worst at 0
F-1 score
64
65
1. Graphical approach
2. Statistical approach
3. Machine learning approach
4. Density based approach
5. Time series approach
Illustration of five methodologies to Anomaly Detection
66
ü Boxplot
ü Scatter plot
ü Adjusted quantile plot
ü Symbol plot
Graphical approaches
• Graphical methods utilize extreme value analysis, by which outliers
correspond to the statistical tails of probability distributions.
• Statistical tails are most commonly used for one dimensional
distributions, although the same concept can be applied to
multidimensional case.
• It is important to understand that all extreme values are outliers
but the reverse may not be true.
• For instance in one dimensional dataset of
{1,3,3,3,50,97,97,97,100}, observation 50 equals to mean and isn’t
considered as an extreme value, but since this observation is the
most isolated point, it should be considered as an outlier.
67
Box plot
• A standardized way of displaying the
variation of data based on the five
number summary, which includes
minimum, first quartile, median, third
quartile, and maximum.
• This plot does not make any assumptions
of the underlying statistical distribution.
• Any data not included between the
minimum and maximum are considered
as an outlier.
68
Boxplot
69
See Graphical_Approach.R
Side-by-side boxplot for each variable
Scatter plot
• A mathematical diagram, which uses Cartesian coordinates for plotting ordered
pairs to show the correlation between typically two random variables.
• An outlier is defined as a data point that doesn't seem to fit with the rest of the
data points.
• In scatterplots, outliers of either intersection or union sets of two variables can
be shown.
70
Scatterplot
71
See Graphical_Approach.R
Scatterplot of Sepal.Width and Sepal.Length
72
• In statistics, a Q–Q plot is a probability plot, which is a graphical
method for comparing two probability distributions by plotting their
quantiles against each other.
• If the two distributions being compared are similar, the points in the
Q–Q plot will approximately lie on the line y = x.
Q-Q plot
Source: Wikipedia
Adjusted quantile plot
• This plot identifies possible multivariate outliers by calculating the Mahalanobis
distance of each point from the center of the data.
• Multi-dimensional Mahalanobis distance between vectors x and y in !" can be
formulated as:
d(x,y)	=	 x − y ,S./(x − y)
where x and y are random vectors of the same distribution with the covariance
matrix S.
• An outlier is defined as a point with a distance larger than some pre-determined
value.
73
Adjusted quantile plot
• Before applying this method and many other parametric
multivariate methods, first we need to check if the data is
multivariate normally distributed using different
multivariate normality tests, such as Royston, Mardia, Chi-
square, univariate plots, etc.
• In R, we use the “mvoutlier” package, which utilizes
graphical approaches as discussed above.
74
Adjusted quantile plot
75
Min-Max normalization before diving into analysis
Multivariate normality test
Outlier Boolean vector identifies the
outliers
Alpha defines maximum thresholding proportion
See Graphical_Approach.R
Adjusted quantile plot
76
See Graphical_Approach.R
Mahalanobis distances
Covariance matrix
Adjusted quantile plot
77
See Graphical_Approach.R
Symbol plot
• This plot plots two dimensional data, using robust Mahalanobis distances based
on the minimum covariance determinant(mcd) estimator with adjustment.
• Minimum Covariance Determinant (MCD) estimator looks for the subset of h
data points whose covariance matrix has the smallest determinant.
• Four drawn ellipsoids in the plot show the Mahalanobis distances correspond to
25%, 50%, 75% and adjusted quantiles of the chi-square distribution.
78
Symbol plot
79
See Graphical_Approach.R
Parameter “quan” defines the amount of observations,
which are used for minimum covariance determinant
estimations. The default is 0.5.
Alpha defines the amount of observations used for
calculating the adjusted quantile.
Case study 1: Anomaly Detection With Freddie
Mac Data
2016 Copyright QuantUniversity LLC.
81
ü Hypothesis testing ( Chi-square test, Grubb’s test)
ü Scores
Hypothesis testing
• This method draws conclusions about a sample point by testing whether it
comes from the same distribution as the training data.
• Statistical tests, such as the t-test and the ANOVA table, can be used on multiple
subsets of the data.
• Here, the level of significance, i.e, the probability of incorrectly rejecting the
true null hypothesis, needs to be chosen.
• To apply this method in R, “outliers” package, which utilizes statistical
tests, is used .
82
Chi-square test
• Chi-square test performs a simple test for detecting outliers of univariate data
based on Chi-square distribution of squared difference between data and
sample mean.
• In this test, sample variance counts as the estimator of the population variance.
• Chi-square test helps us identify the lowest and highest values, since outliers
can exist in both tails of the data.
83
84
When an analyst attempts to fit a statistical model to observed data, he or she may wonder how well the model actually
reflects the data. How "close" are the observed values to those which would be expected under the fitted model? One
statistical test that addresses this issue is the chi-square goodness of fit test.
This test is commonly used to test association of variables in two-way tables where the assumed model of independence is
evaluated against the observed data. In general, the chi-square test statistic is of the form
.
If the computed test statistic is large, then the observed and expected values are not close and the model is a poor fit to the
data (anomaly).
Chi-square test
Chi-square test
85
See Statistical_Approach.R
This function repeats the Chi-square test until it finds all
the outliers within the data.
Grubbs’ test
• Test for outliers for univariate data sets assumed to come from a normally
distributed population.
• Grubbs' test detects one outlier at a time. This outlier is expunged from the
dataset and the test is iterated until no outliers are detected.
• This test is defined for the following hypotheses:
H0: There are no outliers in the data set
H1: There is exactly one outlier in the data set
• The Grubbs' test statistic is defined as:
86
Grubbs’ test
87
See Statistical_Approach.R
The above function repeats the Grubbs’ test until it finds
all the outliers within the data.
Grubbs’ test
88
See Statistical_Approach.R
Histogram of normal observations vs outliers)
Scores
• Scores quantifies the tendency of a data point being an outlier by assigning it a
score or probability.
• The most commonly used scores are:
▫ Normal score:
!" #$%&'
()&'*&+* *%,-&)-.'
▫ T-student score:
(0#(1+) '#2 )
(1+)(0#4#)5)
▫ Chi-square score:
!" #$%&'
(*
2
▫ IQR score: 67-64
• By using “score” function in R, p-values can be returned instead of scores.
89
Scores
90
See Statistical_Approach.R
“type” defines the type of the score, such as
normal, t-student, etc.
“prob=1” returns the corresponding p-value.
Scores
91
See Statistical_Approach.R
By setting “prob” to any specific value, logical vector
returns the data points, whose probabilities are
greater than this cut-off value, as outliers.
By setting “type” to IQR, all values lower than first
and greater than third quartiles are considered and
difference between them and nearest quartile
divided by IQR is calculated.
92
ü Linear regression
ü Piecewise/ segmented regression
ü Autoencoder-Decoder
ü Clustering-based approaches
Linear regression
• Linear regression investigates the linear relationships between variables and
predict one variable based on one or more other variables and it can be
formulated as:
! = #$ + &
'()
*
#'+'
where Y and +' are random variables, #' is regression coefficient and #$ is a
constant.
• In this model, ordinary least squares estimator is usually used to minimize the
difference between the dependent variable and independent variables.
93
Piecewise/segmented regression
• A method in regression analysis, in which the independent variable is
partitioned into intervals to allow multiple linear models to be fitted to data for
different ranges.
• This model can be applied when there are ‘breakpoints’ and clearly two
different linear relationships in the data with a sudden, sharp change in
directionality. Below is a simple segmented regression for data with two
breakpoints:
! = #$ + &'( ( < ('
! = #' + &*( ( > ('
where Y is a predicted value, X is an independent variable, #$ and #' are
constant values, &' and &* are regression coefficients, and (' and (* are
breakpoints.
94
95
Anomaly detection vs Supervised learning
Piecewise/segmented regression
• For this example, we use “segmented” package in R to first illustrate piecewise
regression for two dimensional data set, which has a breakpoint around z=0.5.
96
See Piecewise_Regression.R
“pmax” is used for parallel maximization to
create different values for y.
Piecewise/segmented regression
• Then, we use linear regression to predict y values for each segment of z.
97
See Piecewise_Regression.R
Piecewise/segmented regression
• Finally, the outliers can be detected for each segment by setting some rules for
residuals of model.
98
See Piecewise_Regression.R
Here, we set the rule for the residuals corresponding to z
less than 0.5, by which the outliers with residuals below
0.5 can be defined as outliers.
99
• Motivation1:
Autoencoders
1. http://ai.stanford.edu/~quocle/tutorial2.pdf
100
• Goal is to have !" to approximate x
• Interesting applications such as
▫ Data compression
▫ Visualization
▫ Pre-train neural networks
Autoencoder
101
Demo in Keras1
1. https://blog.keras.io/building-autoencoders-in-keras.html
2. https://keras.io/models/model/
102
Principal Component Analysis
Principal component analysis (PCA) is a statistical
procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated
variables (entities each of which takes on various
numerical values) into a set of values of linearly
uncorrelated variables called principal components.
In Outlier analysis, we do principal component
analysis and compute p-values to test for outliers.
https://en.wikipedia.org/wiki/Principal_component_analysis
Clustering-based approaches
• These methods are suitable for unsupervised anomaly detection.
• They aim to partition the data into meaningful groups (clusters) based on the
similarities and relationships between the groups found in the data.
• Each data point is assigned a degree of membership for each of the clusters.
• Anomalies are those data points that:
▫ Do not fit into any clusters.
▫ Belong to a particular cluster but are far away from the cluster centroid.
▫ Form small or sparse clusters.
103
Clustering-based approaches
• These methods partition the data into k clusters by assigning each data point to
its closest cluster centroid by minimizing the within-cluster sum of squares
(WSS), which is:
!
"#$
%
!
&∈()
!
*#$
+
(-&* − /"*)1
where 2" is the set of observations in the kth cluster and /"* is the mean of jth
variable of the cluster center of the kth cluster.
• Then, they select the top n points that are the farthest away from their nearest
cluster centers as outliers.
104
105
Anomaly Detection vs Unsupervised Learning
Clustering-based approaches
• “Kmod” package in R is used to show the application of K-means model.
106
In this example the number of clusters is defined
through bend graph in order to pass to K-mod
function.
See Clustering_Approach.R
Clustering-based approaches
107
See Clustering_Approach.R
K=4 is the number of clusters and L=10 is
the number of outliers
Clustering-based approaches
108
See Clustering_Approach.R
Scatter plots of normal and outlier data points
Case study 2: Anomaly Detection With German
Credit data
2016 Copyright QuantUniversity LLC.
Case study 3: Anomaly Detection Auto-Encoder
Decoders
2016 Copyright QuantUniversity LLC.
111
ü Local outlier factor
Local Outlier Factor (LOF)
• Local outlier factor (LOF) algorithm first calculates the density of local
neighborhood for each point.
• Then for each object such as p, LOF score is defined as the average of the ratios
of the density of sample p and the density of its nearest neighbors. The number
of nearest neighbors, k, is given by user.
• Points with largest LOF scores are considered as outliers.
• In R, both “DMwR” and “Rlof” packages can be used for performing LOF model.
112
Local Outlier Factor (LOF)
• The LOF scores for outlying points will be high because they are computed in
terms of the ratios to the average neighborhood reachability distances.
• As a result for data points, which distributed homogenously in the cluster, the
LOF scores will be close to one.
• Over a different range of values for k, the maximum LOF score will determine
the scores associated with the local outliers.
113
Local Outlier Factor (R)
• LOF returns a numeric vector of scores for each observation in the data set.
114
k, is the number of neighbors that is used in
calculation of local outlier scores.
See Density_Approach.R
Outlier indexes
Local Outlier Factor (R)
115
Local outliers are shown in
red.
See Density_Approach.R
116
Local Outlier Factor (R)
Histogram of regular observations vs outliers
See Density_Approach.R
117
ü Twitter Outlier Detection
Time-series method
• Time-series model is used to identify outliers only in univariate time-series
data.
• In order to apply this model, we use “Anomalydetection” package in R.
• This package was published by twitter for detecting anomalies in time-series
data in the presence of seasonality and an underlying trend using statistical
approaches.
• Since this package uses a specific algorithm to detect anomalies, we go over it
in details in the next slide.
Anomaly detection, R package
• Twitter’s R package: https://github.com/twitter/AnomalyDetection
• Seasonal Hybrid ESD (S-H-ESD), which builds upon the Generalized ESD test, is
the underlying algorithm of this package.
• The algorithm employs time series decomposition and statistical metrics with
ESD test.
• Since the time-series data exhibit a huge variety of pattern, time-series
decomposition, which a statistical method, is used to decompose the data into
its four components.
• The four components are:
1. Trend: refers to the long term progression of the series
2. Cyclical: refers to variations in recognizable cycles
3. Seasonal: refers to seasonal variations or fluctuations
4. Irregular: describes random, irregular influences
120
• The generalized (extreme Studentized deviate) ESD test (Rosner 1983) is used to detect
one or more outliers in a univariate data set that follows an approximately normal
distribution.
• The primary limitation of the Grubbs test is that the suspected number of outliers, k,
must be specified exactly. If k is not specified correctly, this can distort the conclusions
of these tests
• https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm
Generalized ESD Test for Outliers
Anomaly detection, R package
121
See TimeSeriesAnomalies.ipynb
122
Anomaly Detection as a service
123
• Also try:
▫ https://github.com/omri374/taganomaly
▫ https://docs.microsoft.com/en-us/azure/machine-learning/studio-
module-reference/anomaly-detection
▫ https://www.youtube.com/watch?v=Ra8HhBLdzHE
Anomaly as a service
Summary
124
We have covered Anomaly detection
Introduction ü Definition of anomaly detection and its importance in energy systems
ü Different types of anomaly detection methods: Statistical, graphical and machine learning methods
Graphical approach ü Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol plot to demonstrate
outliers graphically
ü The main assumption for applying graphical approaches is multivariate normality
ü Mahalanobis distance methods is mainly used for calculating the distance of a point from a center of
multivariate distribution
Statistical approach ü Statistical hypothesis testing includes of: Chi-square, Grubb’s test
ü Statistical methods may use either scores or p-value as threshold to detect outliers
Machine learning
approach
ü Both supervised and unsupervised learning methods can be used for outlier detection
ü Piece wised or segmented regression can be used to identify outliers based on the residuals for each segment
ü In K-means clustering method outliers are defined as points which have doesn’t belong to any cluster, are far
away from the centroids of the cluster or shaping sparse clusters
ü In PCA, Auto-encoder decoder methods, we look at points that weren’t recovered closer to the original points
as anomalies
Density approach ü Local outlier factor algorithm is used to detect local outliers
ü The relative density of a data point is compared the density of it’s k nearest neighbors. K is mainly identified by
user
Time series
methods
ü Temporal outlier detection to detect anomalies which is robust, from a statistical standpoint, in the presence of
seasonality and an underlying trend.
(MATLAB version also available)
www.analyticscertificate.com
126
Contact
info@qusandbox.com
for access to labs
Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
127

Weitere ähnliche Inhalte

Was ist angesagt?

Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaPyData
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detectionShantanuDeosthale
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxImpetus Technologies
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptxNTUConcepts1
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Usama Fayyaz
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced dataSaurabhWani6
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 

Was ist angesagt? (20)

Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 

Ähnlich wie Anomaly detection Workshop slides

Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop QuantUniversity
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfAschalewAyele2
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?Smarten Augmented Analytics
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptxAniket Patil
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptxpatilaniket2418
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersSatyam Jaiswal
 
Fraud detection- Retail, Banking, Finance & FMCG
Fraud detection- Retail, Banking, Finance & FMCGFraud detection- Retail, Banking, Finance & FMCG
Fraud detection- Retail, Banking, Finance & FMCGArtivatic.ai
 
Anomly and fraud detection using AI - Artivatic.ai
Anomly and fraud detection using AI - Artivatic.aiAnomly and fraud detection using AI - Artivatic.ai
Anomly and fraud detection using AI - Artivatic.aiArtivatic.ai
 
DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxAkash527744
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - ReportAkanksha Gohil
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptxImXaib
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agricultureAboul Ella Hassanien
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyAlon Bochman, CFA
 

Ähnlich wie Anomaly detection Workshop slides (20)

Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop Anomaly detection : QuantUniversity Workshop
Anomaly detection : QuantUniversity Workshop
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
 
Lecture 3 ml
Lecture 3 mlLecture 3 ml
Lecture 3 ml
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & Answers
 
Fraud detection- Retail, Banking, Finance & FMCG
Fraud detection- Retail, Banking, Finance & FMCGFraud detection- Retail, Banking, Finance & FMCG
Fraud detection- Retail, Banking, Finance & FMCG
 
Anomly and fraud detection using AI - Artivatic.ai
Anomly and fraud detection using AI - Artivatic.aiAnomly and fraud detection using AI - Artivatic.ai
Anomly and fraud detection using AI - Artivatic.ai
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 

Mehr von QuantUniversity

EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !QuantUniversity
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfManaging-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfQuantUniversity
 
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALSPYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALSQuantUniversity
 
Qu for India - QuantUniversity FundRaiser
Qu for India  - QuantUniversity FundRaiserQu for India  - QuantUniversity FundRaiser
Qu for India - QuantUniversity FundRaiserQuantUniversity
 
Ml master class for CFA Dallas
Ml master class for CFA DallasMl master class for CFA Dallas
Ml master class for CFA DallasQuantUniversity
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0QuantUniversity
 
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...QuantUniversity
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...QuantUniversity
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewQuantUniversity
 
AI Explainability and Model Risk Management
AI Explainability and Model Risk ManagementAI Explainability and Model Risk Management
AI Explainability and Model Risk ManagementQuantUniversity
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0QuantUniversity
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021QuantUniversity
 
Bayesian Portfolio Allocation
Bayesian Portfolio AllocationBayesian Portfolio Allocation
Bayesian Portfolio AllocationQuantUniversity
 
Constructing Private Asset Benchmarks
Constructing Private Asset BenchmarksConstructing Private Asset Benchmarks
Constructing Private Asset BenchmarksQuantUniversity
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning InterpretabilityQuantUniversity
 
Responsible AI in Action
Responsible AI in ActionResponsible AI in Action
Responsible AI in ActionQuantUniversity
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQuantUniversity
 

Mehr von QuantUniversity (20)

EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfManaging-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
 
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALSPYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
 
Qu for India - QuantUniversity FundRaiser
Qu for India  - QuantUniversity FundRaiserQu for India  - QuantUniversity FundRaiser
Qu for India - QuantUniversity FundRaiser
 
Ml master class for CFA Dallas
Ml master class for CFA DallasMl master class for CFA Dallas
Ml master class for CFA Dallas
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0
 
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper review
 
AI Explainability and Model Risk Management
AI Explainability and Model Risk ManagementAI Explainability and Model Risk Management
AI Explainability and Model Risk Management
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021
 
Bayesian Portfolio Allocation
Bayesian Portfolio AllocationBayesian Portfolio Allocation
Bayesian Portfolio Allocation
 
The API Jungle
The API JungleThe API Jungle
The API Jungle
 
Explainable AI Workshop
Explainable AI WorkshopExplainable AI Workshop
Explainable AI Workshop
 
Constructing Private Asset Benchmarks
Constructing Private Asset BenchmarksConstructing Private Asset Benchmarks
Constructing Private Asset Benchmarks
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretability
 
Responsible AI in Action
Responsible AI in ActionResponsible AI in Action
Responsible AI in Action
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in Finance
 
Qwafafew meeting 5
Qwafafew meeting 5Qwafafew meeting 5
Qwafafew meeting 5
 

Kürzlich hochgeladen

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 

Kürzlich hochgeladen (20)

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 

Anomaly detection Workshop slides

  • 1. Location: Qcon.ai Conference San Francisco April 15th 2019 Anomaly Detection Techniques and Best Practices 2019 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP www.QuantUniversity.com sri@quantuniversity.com
  • 2. 2 • Introduction • Applications of Anomaly Detection • Break – 10.30-10.45am • Aspects of Anomaly Detection • Lunch Break : 12.00-1.00pm • Techniques- Deep Dive • Break – 2.30-2.45pm • Labs and Examples Agenda
  • 3. 3
  • 4. - Analytics Advisory services - Custom training programs - Architecture assessments, advice and audits
  • 5. • Founder of QuantUniversity LLC. and www.analyticscertificate.com • Advisory and Consultancy for Financial Analytics • Prior Experience at MathWorks, Citigroup and Endeca and 25+ financial services and energy customers • Regular Columnist for the Wilmott Magazine • Author of forthcoming book “Financial Modeling: A case study approach” published by Wiley • Charted Financial Analyst and Certified Analytics Professional • Analytics Faculty in the Babson College MBA program and at Northeastern University, Boston Sri Krishnamurthy Founder and CEO 5
  • 6. 6 Quantitative Analytics and Big Data Analytics Bootcamps • Analytics Certificate program • Fintech Certificate program • Deep Learning & AI boot camp • Natural Language Processing workshop • Machine Learning for Finance • Machine Learning for Healthcare applications See www.analyticscertifcate.com for current and future offerings
  • 7. (MATLAB version also available)
  • 8. 8
  • 9. 9
  • 10. What is anomaly detection? • Anomalies or outliers are data points within the datasets that appear to deviate markedly from expected outputs. • An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism1 • Anomaly detection refers to the problem of finding patterns in data that don’t confirm to expected behavior 10 1. D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.
  • 11. 11 • Outliers are data points that are considered out of the ordinary or abnormal . This includes noise. • Anomalies are a special kind of outlier that has significant/ critical/actionable information which could be of interest to analysts. Anomaly vs Outliers 1 2 All points not in clusters 1 & 2 are Outliers Point B is an Anomaly (Both X and Y are large)
  • 12. 12 • Note that it is the analyst’s judgement that determines what is considered as just an outlier or is an anomaly. • Most outlier detection methods generate an output that are: ▫ Real-valued outlier scores: quantifies the tendency of a data point being an outlier by assigning a score or probability to it. ▫ Binary labels: result of using a threshold to convert outlier scores to binary labels, inlier or outlier. Outlierness
  • 13. 13 • Fraud Detection ▫ Credit card fraud detection – By owner or by operation ▫ Mobile phone fraud/anomaly detection – Calling behavior, volume etc. ▫ Insurance claim fraud detection – Medical malpractice – Auto insurance ▫ Insider trading detection • E-commerce ▫ Pricing issues ▫ Network issues Applications of Anomaly Detection
  • 14. 14 • Intrusion detection: ▫ Detect malicious activity in computer systems ▫ This could be host-based or network-based • Medical anomalies Examples of Anomaly Detection
  • 15. 15 • Manufacturing and sensors: ▫ Fault detection ▫ Heat, fire sensors • Text data ▫ Novel topics, events ▫ Plagiarism Examples of Anomaly Detection
  • 16. 16 1. Importance of a defining what is normal 2. Challenges when dealing with Anomaly Detection problems 3. Input Data Characteristics 4. Anomaly Classification 5. Outputs of Anomaly Detection 6. Techniques used for Anomaly Detection 7. Performance evaluation Aspects of Anomaly detection problems
  • 17. 17 • By definition, anomaly detection deals with identifying patterns and points that are not considered normal. This implies that we must first have a model to define what is normal in our datasets. • If our model doesn’t capture the nuances of “normal” behavior in our datasets, our anomaly detection algorithms won’t fare well. • If our model captures all nuances in our data, we would have overfit the model and wouldn’t be able to identify anomalies properly • If our model is too generic, then most points would show up as anomalies. • Let’s illustrate this 1. Importance of a defining what is normal
  • 18. 18 • Consider the points below. Our goal is to build a “normal” model and an anomaly detection model for the following data set. • y = [2,5,7,8,11,5,3 4 5.8 8] • Choose a “normal model”. • Determine the standard deviation and mark any point that is outside of the 3σ limit as an anomaly Model assumption
  • 19. 19 • Consider three possible “normal” models ▫ Line A connects all the points : Over fit : No anomalies ▫ Line B is a linear regression line : Poor it : Many points listed as anomalies ▫ Line C is a polynomial fit of degree 4 : Good fit. One point shown as anomaly as it is outside of the 3σ band. This illustrates why the choice of The “normal” model is critical Model assumption
  • 20. 20 • We could assume that the points may come from a statistical distribution. For example from a Gaussian distribution • We could use an algorithm like k-nearest neighbor to define points that are close and identify outliers that are at a large distance from most of the points • We could use a clustering algorithm to assign membership to clusters Determining the “normal” model
  • 21. 21 • We will illustrate once more why the choice of the “normal” model is important. • We said earlier that the “outlierness” can be quantified as a score. • One popular score is Z-score. • A z-score can be calculated from the following formula. z= (X - μ) / σ • where z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation. • It computes the number of standard deviations by which the data varies from the mean and used as a proxy to determine outliers. • However, this works only if the data is from a Gaussian distribution Importance of a defining what is normal
  • 22. 22 • In the first case, 99.9% of the data is within the 3σ limits. The Z- score test works well here to detect outliers. • In the second case, the distribution isn’t normal and is from a Zipf distribution. Here a Z-score test isn’t valid Importance of a defining what is normal
  • 23. 23 • It is important for the analyst choose the right representation for the normal model. • In the first case, Clustering is useless but a Linear regression is a good representation. • In the second case, Clustering is more appropriate. Importance of a defining what is normal
  • 24. 24 1. Importance of a defining what is normal 2. Challenges when dealing with Anomaly Detection problems 3. Input Data Characteristics 4. Anomaly Classification 5. Outputs of Anomaly Detection 6. Techniques used for Anomaly Detection 7. Performance evaluation Aspects of Anomaly detection problems
  • 25. 25 • Defining the normal region that covers all normal points in the dataset • Identifying anomalies masquerading as normal data points or not being able to uncover anomalies due to a weak choice of the “normal model” ▫ Example : DOS attack vs DDOS attack 2.0 Challenges when dealing with Anomaly Detection problems
  • 26. 26 • The evolving “normal behavior” in data ▫ Example: – $100+ credit card transactions for a student – Average 5 per month – $100+ credit card transactions for a professional – Average 15 per month • Data and application dependency ▫ Example : – For AAPL, a +/- $5 fluctuation in a day is an anomaly. – For a risky stock, up to +/- $10 fluctuation may be normal and $15 flucation may be an anomaly • Lack of labeled data makes it harder to detect anomalies • Not being able to distinguish noise and anomalies 2.0 Challenges when dealing with Anomaly Detection problems
  • 27. 27 1. Importance of a defining what is normal 2. Challenges when dealing with Anomaly Detection problems 3. Input Data Characteristics 4. Anomaly Classification 5. Outputs of Anomaly Detection 6. Techniques used for Anomaly Detection 7. Performance evaluation Aspects of Anomaly detection problems
  • 28. 28 • Data objects are usually described by a set of attributes (variables, features or dimension) • The term univariate is used when data has one attribute, while bivariate(two) and multivariate(more that two attributes data) terms. • Attributes can be quantitative or qualitative based on their characteristics. 3.0 Input data characteristics
  • 29. Dataset, variable and Observations Dataset: A rectangular array with Rows as observations and columns as variables Variable: A characteristic of members of a population ( Age, State etc.) Observation: List of Variable values for a member of the population
  • 30. Types of Data — A variable is numerical if meaningful arithmetic can be performed on it. — Discrete vs. Continuous — Cross-sectional vs. Longitudinal — Otherwise, the variable is categorical. — Binary vs Multivalued — Ordinal vs Nominal
  • 31. Categorial Variables • Ordinal: natural ordering of its possible values. • Nominal : no natural ordering
  • 32. Categorical Variables • Categorical variables can be coded numerically or left uncoded. • A dummy variable is a 0–1 coded variable for a specific category. It is coded as 1 for all observations in that category and 0 for all observations not in that category. • Categorizing a numerical variable as categorical is called binning (putting the data into discrete bins) or discretizing.
  • 33. Numerical Data • Discrete : ▫ How many cities have you lived in? ▫ How many cars pass through a toll booth? • Continuous: ▫ What’s your height? ▫ What’s the temperature outside? ▫ What was the interest rate in Dec 2004?
  • 35. 35 • In Cross-sectional records, data instances are independent. • Data instances could be related to each other (typically longitudinal) • Types of data records that are typically related to each other ▫ Sequence records: – Time-series data – temporal continuity – Genome and protein sequences ▫ Spatial Data : Data is related to neighbors; Spatial continuity – Traffic data – Weather data – Ecological data ▫ Graph data – Social media data Relationship between records
  • 36. 36 1. Importance of a defining what is normal 2. Challenges when dealing with Anomaly Detection problems 3. Input Data Characteristics 4. Anomaly Classification 5. Outputs of Anomaly Detection 6. Techniques used for Anomaly Detection 7. Performance evaluation Aspects of Anomaly detection problems
  • 37. 37 Anomalies can be classified into three major categories 1. Point Anomalies – In an instance is anomalous compared with the rest of instances, the anomaly is considered a point anomaly 2. Contextual Anomalies – If an instance is anomalous in a specific context, the anomaly would be considered as a contextual anomaly 3. Collective Anomalies – If a collection of related data records are anomalous with respect to the entire data set, the anomaly is a collective anomaly 4.0 Anomaly Classification
  • 38. 38 • In the figure, points o1 and o2 are considered point anomalies • Examples: ▫ A 50% increase in daily stock price ▫ A credit card transaction attempt for $5000 (assuming you have never had a single transaction for anything above $1000) Point Anomalies
  • 39. 39 • In the figure, temperature t2 is an anomaly • Note that t1 is lower than t2 but contextually, t1 is expected and t2 isn’t when compared to records around it. Contextual Anomalies
  • 40. 40 • Multiple Buy Stock transactions and then a sequence of Sell transactions around an earnings release date may be anomalous and may indicate insider trading. • Consider the sequence of network activities recorded • Though ssh, buffer-overflow and ftp themselves are not anomalous activities, a sequence of the three indicates a web-based attack • Similarly, multiple http requests from an ip address may indicate a crawler in action. See https://www.bloomberg.com/graphics/2019-etf-tax-dodge-lets- investors-save-big/ Collective Anomalies
  • 41. 41 • In medicine, abnormal ECG pattern detection would involve looking for collective anomalies like Premature Atrial Contraction Collective Anomalies http://www.fprmed.com/Pages/Cardio/PAC.html
  • 42. 42 1. Importance of a defining what is normal 2. Challenges when dealing with Anomaly Detection problems 3. Input Data Characteristics 4. Anomaly Classification 5. Outputs of Anomaly Detection 6. Techniques used for Anomaly Detection 7. Performance evaluation Aspects of Anomaly detection problems
  • 43. 43 • The goal of an Outlier detection or Anomaly Detection algorithm is to identify if there are anomalies in the data. The outputs would be of the form: ▫ Scores : A number generated by the algorithm for each record. Typically, the scores are sorted and a threshold chosen to designate anomalies ▫ Labels : Here the algorithm takes a binary decision on whether the algorithm is an anomaly or not 5. Outputs of Anomaly Detection algorithms
  • 44. 44 • Based on whether the data is labeled or not, machine learning algorithms can be used for anomaly detection • If the historical data is labeled (anomaly/not), supervised techniques can be used • If the historical data isn’t labeled, unsupervised algorithms can be used to figure out if the data is normal/anomalous 5. Outputs of Anomaly Detection algorithms
  • 45. 45 1. Importance of a defining what is normal 2. Challenges when dealing with Anomaly Detection problems 3. Input Data Characteristics 4. Anomaly Classification 5. Outputs of Anomaly Detection 6. Techniques used for Anomaly Detection 7. Performance evaluation Aspects of Anomaly detection problems
  • 46. 46 1. Extreme value analysis 2. Classification-based techniques 3. Statistical techniques i. Parametric techniques a. Gaussian model-based models b. Regression-based models c. Mixture models ii. Non-Parametric techniques a. Histogram-based models b. Kernel-based models A tour of Anomaly Detection Techniques
  • 47. 47 4. Proximity-based models i. Cluster analysis ii. Nearest neighbor analysis 5. Information theoretic models 6. Meta-algorithms and ensemble techniques i. Sequential ensembles ii. Independent ensembles A tour of Anomaly Detection Techniques
  • 48. 48 • Assumption : Anomalies are extreme values in the data set • Goal: Determine the statistical tails of the underlying distribution • Data : Univariate cross-sectional data • Examples: ▫ Z-score test for a dataset which is assumed to be Normal ▫ Grubb’s test ▫ Using Box-plots to detect anomalies 1. Extreme value Analysis
  • 49. 49 • Assumption : Available labeled data • Goal: Build a classifier that can distinguish between normal and anomalous data • Data : Multi-dimensional cross-sectional data ( Numeric/categorical) • Examples: ▫ Rule-based (Decision trees) ▫ Neural networks ▫ SVM 2. Classification based techniques
  • 50. 50 • Statistical techniques fit a statistical model to a given data and applies a statistical inference test to determine if an unseen instance belongs to this model or not. • Based on the assumed statistical model that describes the data, anomalies are data points that are assumed to have not been generated by the model 3. Statistical techniques
  • 51. 51 • Assumption : The underlying distribution is known and the parameters for the distribution can be estimated • Goal: Infer if a data point belongs to the distribution or not. • Data : Depends on the technique • Techniques: ▫ Gaussian techniques – IQR test (Box plots) – The region between Q1 −1.5*QR and Q3 +1.5*QR contains 99.3% of observations, and hence the choice of the 1.5IQR boundary makes the box plot rule equivalent to the 3σ technique for Gaussian data. – Grubb’s test (later) – Chi-squared test (later) 3. Statistical techniques - Parametric
  • 52. 52 • Regression based techniques – Here the goal is to model the data into a lower dimensional sub-space using linear correlations. – i.e. summarize data in to a model parameterized by the coefficients and constant. – Step 1: Build a regression model with various features – Step 2: Review residuals ; The magnitude of the residuals indicate anomalies. 3. Statistical techniques - Parametric
  • 53. 53 • Mixture models – Example: Gaussian mixture model – Here, the data is characterized by a process that is a mixture of Gaussian clusters. – The parameters are estimated using an EM algorithm – The goal is to determine the probability of data points being in different clusters. – Anomalies would have low probability values 3. Statistical techniques - Parametric
  • 54. 54 • Assumption : The data’s distribution isn’t known apriori • Goal: Infer if a data point belongs to the assumed normal model • Data : Depends on the technique • Techniques: ▫ Histogram – Count/Frequency based: Create histogram. Bins with very few points indicate anomalies ▫ Kernel-based – Using density-estimation techniques, build kernel functions and estimate probability-distribution function(pdf) for normal instances. Instances lying in the low-probability areas termed anomalies 3. Statistical techniques - Nonparametric
  • 56. 56 • Assumption : Anomalous points are isolated from the rest of the data • Goal: Segment the points/space with a goal of identifying anomalies. • Data : Typically, multi-dimensional cross-sectional data • Methods: ▫ Clustering : Unsupervised techniques to group data into clusters using distances/densities depending on the technique. Anomalies belong to sparse clusters/ no clusters and are typically far off from “normal” clusters. ▫ Examples : K-means 4.0 Proximity-based techniques
  • 57. 57 ▫ Nearest neighbor techniques: Here, it is assumed that normal points occur in dense neighborhoods and anomalies are far from neighbors ▫ Here distances/relative densities are used to determine neighborhoods ▫ Examples : – KNN algorithm : Anomaly score of a data instance is defined as its distance to it’s kth nearest neighbor in a given data set. – Local Outlier Factor scores : The anomaly score (LOF) is equal to the ratio of average local density of the k-nearest neighbors of the instance and the local density of the data instance itself. 4.0 Proximity-based techniques
  • 58. 58 • Assumption : Anomalies induce irregularities in the information content of the data set that increase information summaries • Goal: Identify data that can’t be summarized into a lower dimensional space efficiently • Data : Typically, multi-dimensional cross-sectional data • Example: • The first line can be summarized as AB -17 times • With C present in the second line, the second line can no longer be succinctly summarized 5. Information theoretic models
  • 59. 59 • Assumption : Using multiple algorithms would help increase the robustness of the anomaly detection algorithm • Goal: Use ensembles to enhance the quality of anomaly detection • Data : Typically, multi-dimensional cross-sectional data • Methods: ▫ Sequential ensembles : A given algorithm or a sequence of algorithms are applied sequentially. Example: Boosting methods for classification ▫ Independent ensembles: Different algorithms or different instances of the same algorithm are run and results combined to detect robust outliers. 6. Meta-algorithms and sequential techniques
  • 60. 60 1. Importance of a defining what is normal 2. Challenges when dealing with Anomaly Detection problems 3. Input Data Characteristics 4. Anomaly Classification 5. Outputs of Anomaly Detection 6. Techniques used for Anomaly Detection 7. Performance evaluation Aspects of Anomaly detection problems
  • 61. 61 • For Unsupervised cases, hard as data isn’t labeled • For Supervised learning, ROC curve • The true-positive rate is also known as sensitivity, or recall in machine learning. • The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). • The ROC curve is thus the sensitivity as a function of fall-out. 7.0 Performance evaluation
  • 62. 62 • For Unsupervised cases, hard as data isn’t labeled • For Supervised learning, ROC curve • The true-positive rate is also known as sensitivity, or recall in machine learning. • The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). • The ROC curve is thus the sensitivity as a function of fall-out. 7.0 Performance evaluation
  • 63. 63 • F-1 score considers both the precision p and the recall r of the test to compute the score. • The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0 F-1 score
  • 64. 64
  • 65. 65 1. Graphical approach 2. Statistical approach 3. Machine learning approach 4. Density based approach 5. Time series approach Illustration of five methodologies to Anomaly Detection
  • 66. 66 ü Boxplot ü Scatter plot ü Adjusted quantile plot ü Symbol plot
  • 67. Graphical approaches • Graphical methods utilize extreme value analysis, by which outliers correspond to the statistical tails of probability distributions. • Statistical tails are most commonly used for one dimensional distributions, although the same concept can be applied to multidimensional case. • It is important to understand that all extreme values are outliers but the reverse may not be true. • For instance in one dimensional dataset of {1,3,3,3,50,97,97,97,100}, observation 50 equals to mean and isn’t considered as an extreme value, but since this observation is the most isolated point, it should be considered as an outlier. 67
  • 68. Box plot • A standardized way of displaying the variation of data based on the five number summary, which includes minimum, first quartile, median, third quartile, and maximum. • This plot does not make any assumptions of the underlying statistical distribution. • Any data not included between the minimum and maximum are considered as an outlier. 68
  • 70. Scatter plot • A mathematical diagram, which uses Cartesian coordinates for plotting ordered pairs to show the correlation between typically two random variables. • An outlier is defined as a data point that doesn't seem to fit with the rest of the data points. • In scatterplots, outliers of either intersection or union sets of two variables can be shown. 70
  • 72. 72 • In statistics, a Q–Q plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. • If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line y = x. Q-Q plot Source: Wikipedia
  • 73. Adjusted quantile plot • This plot identifies possible multivariate outliers by calculating the Mahalanobis distance of each point from the center of the data. • Multi-dimensional Mahalanobis distance between vectors x and y in !" can be formulated as: d(x,y) = x − y ,S./(x − y) where x and y are random vectors of the same distribution with the covariance matrix S. • An outlier is defined as a point with a distance larger than some pre-determined value. 73
  • 74. Adjusted quantile plot • Before applying this method and many other parametric multivariate methods, first we need to check if the data is multivariate normally distributed using different multivariate normality tests, such as Royston, Mardia, Chi- square, univariate plots, etc. • In R, we use the “mvoutlier” package, which utilizes graphical approaches as discussed above. 74
  • 75. Adjusted quantile plot 75 Min-Max normalization before diving into analysis Multivariate normality test Outlier Boolean vector identifies the outliers Alpha defines maximum thresholding proportion See Graphical_Approach.R
  • 76. Adjusted quantile plot 76 See Graphical_Approach.R Mahalanobis distances Covariance matrix
  • 77. Adjusted quantile plot 77 See Graphical_Approach.R
  • 78. Symbol plot • This plot plots two dimensional data, using robust Mahalanobis distances based on the minimum covariance determinant(mcd) estimator with adjustment. • Minimum Covariance Determinant (MCD) estimator looks for the subset of h data points whose covariance matrix has the smallest determinant. • Four drawn ellipsoids in the plot show the Mahalanobis distances correspond to 25%, 50%, 75% and adjusted quantiles of the chi-square distribution. 78
  • 79. Symbol plot 79 See Graphical_Approach.R Parameter “quan” defines the amount of observations, which are used for minimum covariance determinant estimations. The default is 0.5. Alpha defines the amount of observations used for calculating the adjusted quantile.
  • 80. Case study 1: Anomaly Detection With Freddie Mac Data 2016 Copyright QuantUniversity LLC.
  • 81. 81 ü Hypothesis testing ( Chi-square test, Grubb’s test) ü Scores
  • 82. Hypothesis testing • This method draws conclusions about a sample point by testing whether it comes from the same distribution as the training data. • Statistical tests, such as the t-test and the ANOVA table, can be used on multiple subsets of the data. • Here, the level of significance, i.e, the probability of incorrectly rejecting the true null hypothesis, needs to be chosen. • To apply this method in R, “outliers” package, which utilizes statistical tests, is used . 82
  • 83. Chi-square test • Chi-square test performs a simple test for detecting outliers of univariate data based on Chi-square distribution of squared difference between data and sample mean. • In this test, sample variance counts as the estimator of the population variance. • Chi-square test helps us identify the lowest and highest values, since outliers can exist in both tails of the data. 83
  • 84. 84 When an analyst attempts to fit a statistical model to observed data, he or she may wonder how well the model actually reflects the data. How "close" are the observed values to those which would be expected under the fitted model? One statistical test that addresses this issue is the chi-square goodness of fit test. This test is commonly used to test association of variables in two-way tables where the assumed model of independence is evaluated against the observed data. In general, the chi-square test statistic is of the form . If the computed test statistic is large, then the observed and expected values are not close and the model is a poor fit to the data (anomaly). Chi-square test
  • 85. Chi-square test 85 See Statistical_Approach.R This function repeats the Chi-square test until it finds all the outliers within the data.
  • 86. Grubbs’ test • Test for outliers for univariate data sets assumed to come from a normally distributed population. • Grubbs' test detects one outlier at a time. This outlier is expunged from the dataset and the test is iterated until no outliers are detected. • This test is defined for the following hypotheses: H0: There are no outliers in the data set H1: There is exactly one outlier in the data set • The Grubbs' test statistic is defined as: 86
  • 87. Grubbs’ test 87 See Statistical_Approach.R The above function repeats the Grubbs’ test until it finds all the outliers within the data.
  • 88. Grubbs’ test 88 See Statistical_Approach.R Histogram of normal observations vs outliers)
  • 89. Scores • Scores quantifies the tendency of a data point being an outlier by assigning it a score or probability. • The most commonly used scores are: ▫ Normal score: !" #$%&' ()&'*&+* *%,-&)-.' ▫ T-student score: (0#(1+) '#2 ) (1+)(0#4#)5) ▫ Chi-square score: !" #$%&' (* 2 ▫ IQR score: 67-64 • By using “score” function in R, p-values can be returned instead of scores. 89
  • 90. Scores 90 See Statistical_Approach.R “type” defines the type of the score, such as normal, t-student, etc. “prob=1” returns the corresponding p-value.
  • 91. Scores 91 See Statistical_Approach.R By setting “prob” to any specific value, logical vector returns the data points, whose probabilities are greater than this cut-off value, as outliers. By setting “type” to IQR, all values lower than first and greater than third quartiles are considered and difference between them and nearest quartile divided by IQR is calculated.
  • 92. 92 ü Linear regression ü Piecewise/ segmented regression ü Autoencoder-Decoder ü Clustering-based approaches
  • 93. Linear regression • Linear regression investigates the linear relationships between variables and predict one variable based on one or more other variables and it can be formulated as: ! = #$ + & '() * #'+' where Y and +' are random variables, #' is regression coefficient and #$ is a constant. • In this model, ordinary least squares estimator is usually used to minimize the difference between the dependent variable and independent variables. 93
  • 94. Piecewise/segmented regression • A method in regression analysis, in which the independent variable is partitioned into intervals to allow multiple linear models to be fitted to data for different ranges. • This model can be applied when there are ‘breakpoints’ and clearly two different linear relationships in the data with a sudden, sharp change in directionality. Below is a simple segmented regression for data with two breakpoints: ! = #$ + &'( ( < (' ! = #' + &*( ( > (' where Y is a predicted value, X is an independent variable, #$ and #' are constant values, &' and &* are regression coefficients, and (' and (* are breakpoints. 94
  • 95. 95 Anomaly detection vs Supervised learning
  • 96. Piecewise/segmented regression • For this example, we use “segmented” package in R to first illustrate piecewise regression for two dimensional data set, which has a breakpoint around z=0.5. 96 See Piecewise_Regression.R “pmax” is used for parallel maximization to create different values for y.
  • 97. Piecewise/segmented regression • Then, we use linear regression to predict y values for each segment of z. 97 See Piecewise_Regression.R
  • 98. Piecewise/segmented regression • Finally, the outliers can be detected for each segment by setting some rules for residuals of model. 98 See Piecewise_Regression.R Here, we set the rule for the residuals corresponding to z less than 0.5, by which the outliers with residuals below 0.5 can be defined as outliers.
  • 100. 100 • Goal is to have !" to approximate x • Interesting applications such as ▫ Data compression ▫ Visualization ▫ Pre-train neural networks Autoencoder
  • 101. 101 Demo in Keras1 1. https://blog.keras.io/building-autoencoders-in-keras.html 2. https://keras.io/models/model/
  • 102. 102 Principal Component Analysis Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. In Outlier analysis, we do principal component analysis and compute p-values to test for outliers. https://en.wikipedia.org/wiki/Principal_component_analysis
  • 103. Clustering-based approaches • These methods are suitable for unsupervised anomaly detection. • They aim to partition the data into meaningful groups (clusters) based on the similarities and relationships between the groups found in the data. • Each data point is assigned a degree of membership for each of the clusters. • Anomalies are those data points that: ▫ Do not fit into any clusters. ▫ Belong to a particular cluster but are far away from the cluster centroid. ▫ Form small or sparse clusters. 103
  • 104. Clustering-based approaches • These methods partition the data into k clusters by assigning each data point to its closest cluster centroid by minimizing the within-cluster sum of squares (WSS), which is: ! "#$ % ! &∈() ! *#$ + (-&* − /"*)1 where 2" is the set of observations in the kth cluster and /"* is the mean of jth variable of the cluster center of the kth cluster. • Then, they select the top n points that are the farthest away from their nearest cluster centers as outliers. 104
  • 105. 105 Anomaly Detection vs Unsupervised Learning
  • 106. Clustering-based approaches • “Kmod” package in R is used to show the application of K-means model. 106 In this example the number of clusters is defined through bend graph in order to pass to K-mod function. See Clustering_Approach.R
  • 107. Clustering-based approaches 107 See Clustering_Approach.R K=4 is the number of clusters and L=10 is the number of outliers
  • 108. Clustering-based approaches 108 See Clustering_Approach.R Scatter plots of normal and outlier data points
  • 109. Case study 2: Anomaly Detection With German Credit data 2016 Copyright QuantUniversity LLC.
  • 110. Case study 3: Anomaly Detection Auto-Encoder Decoders 2016 Copyright QuantUniversity LLC.
  • 112. Local Outlier Factor (LOF) • Local outlier factor (LOF) algorithm first calculates the density of local neighborhood for each point. • Then for each object such as p, LOF score is defined as the average of the ratios of the density of sample p and the density of its nearest neighbors. The number of nearest neighbors, k, is given by user. • Points with largest LOF scores are considered as outliers. • In R, both “DMwR” and “Rlof” packages can be used for performing LOF model. 112
  • 113. Local Outlier Factor (LOF) • The LOF scores for outlying points will be high because they are computed in terms of the ratios to the average neighborhood reachability distances. • As a result for data points, which distributed homogenously in the cluster, the LOF scores will be close to one. • Over a different range of values for k, the maximum LOF score will determine the scores associated with the local outliers. 113
  • 114. Local Outlier Factor (R) • LOF returns a numeric vector of scores for each observation in the data set. 114 k, is the number of neighbors that is used in calculation of local outlier scores. See Density_Approach.R Outlier indexes
  • 115. Local Outlier Factor (R) 115 Local outliers are shown in red. See Density_Approach.R
  • 116. 116 Local Outlier Factor (R) Histogram of regular observations vs outliers See Density_Approach.R
  • 118. Time-series method • Time-series model is used to identify outliers only in univariate time-series data. • In order to apply this model, we use “Anomalydetection” package in R. • This package was published by twitter for detecting anomalies in time-series data in the presence of seasonality and an underlying trend using statistical approaches. • Since this package uses a specific algorithm to detect anomalies, we go over it in details in the next slide.
  • 119. Anomaly detection, R package • Twitter’s R package: https://github.com/twitter/AnomalyDetection • Seasonal Hybrid ESD (S-H-ESD), which builds upon the Generalized ESD test, is the underlying algorithm of this package. • The algorithm employs time series decomposition and statistical metrics with ESD test. • Since the time-series data exhibit a huge variety of pattern, time-series decomposition, which a statistical method, is used to decompose the data into its four components. • The four components are: 1. Trend: refers to the long term progression of the series 2. Cyclical: refers to variations in recognizable cycles 3. Seasonal: refers to seasonal variations or fluctuations 4. Irregular: describes random, irregular influences
  • 120. 120 • The generalized (extreme Studentized deviate) ESD test (Rosner 1983) is used to detect one or more outliers in a univariate data set that follows an approximately normal distribution. • The primary limitation of the Grubbs test is that the suspected number of outliers, k, must be specified exactly. If k is not specified correctly, this can distort the conclusions of these tests • https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm Generalized ESD Test for Outliers
  • 121. Anomaly detection, R package 121 See TimeSeriesAnomalies.ipynb
  • 123. 123 • Also try: ▫ https://github.com/omri374/taganomaly ▫ https://docs.microsoft.com/en-us/azure/machine-learning/studio- module-reference/anomaly-detection ▫ https://www.youtube.com/watch?v=Ra8HhBLdzHE Anomaly as a service
  • 124. Summary 124 We have covered Anomaly detection Introduction ü Definition of anomaly detection and its importance in energy systems ü Different types of anomaly detection methods: Statistical, graphical and machine learning methods Graphical approach ü Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbol plot to demonstrate outliers graphically ü The main assumption for applying graphical approaches is multivariate normality ü Mahalanobis distance methods is mainly used for calculating the distance of a point from a center of multivariate distribution Statistical approach ü Statistical hypothesis testing includes of: Chi-square, Grubb’s test ü Statistical methods may use either scores or p-value as threshold to detect outliers Machine learning approach ü Both supervised and unsupervised learning methods can be used for outlier detection ü Piece wised or segmented regression can be used to identify outliers based on the residuals for each segment ü In K-means clustering method outliers are defined as points which have doesn’t belong to any cluster, are far away from the centroids of the cluster or shaping sparse clusters ü In PCA, Auto-encoder decoder methods, we look at points that weren’t recovered closer to the original points as anomalies Density approach ü Local outlier factor algorithm is used to detect local outliers ü The relative density of a data point is compared the density of it’s k nearest neighbors. K is mainly identified by user Time series methods ü Temporal outlier detection to detect anomalies which is robust, from a statistical standpoint, in the presence of seasonality and an underlying trend.
  • 125. (MATLAB version also available) www.analyticscertificate.com
  • 127. Thank you! Sri Krishnamurthy, CFA, CAP Founder and CEO QuantUniversity LLC. srikrishnamurthy www.QuantUniversity.com Contact Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC. 127