Weitere ähnliche Inhalte Ähnlich wie Automating Inferences out of Financial Data (11) Mehr von KNIMESlides (18) Kürzlich hochgeladen (20) Automating Inferences out of Financial Data1. © 2020 KNIME AG. All Right Reserved.
Automating Inferences out of Financial Data
Based on the example of credit card fraud detection
Maarit Widmann maarit.widmann@knime.com
Mathilde Humeau mathilde.humeau@knime.com
2. © 2020 KNIME AG. All Rights Reserved.
Approaches for a labeled vs. unlabeled dataset
• Situation 1: The dataset has enough fraud examples
– Train a classification model
• Situation 2: The dataset has no (or just a negligible
number of) fraud examples
– Use a neural autoencoder
– Use an outlier detection technique, e.g. isolation forest
2
3. © 2020 KNIME AG. All Rights Reserved. 3
Situation 1: The dataset has enough fraud examples
4. © 2020 KNIME AG. All Rights Reserved.
Fraud detection using a labeled dataset
4
Transactions
• Trx 1
• Trx 2
• Trx 3
• Trx 4
• Trx 5
• Trx 6
• …
Model
5. © 2020 KNIME AG. All Rights Reserved.
KNIME Analytics Platform
• An open source tool for data analysis, manipulation, visualization, and
reporting
• Based on the graphical programming paradigm
• Provides a diverse array of extensions:
– Text Mining
– Network Mining
– Cheminformatics
– Many integrations,
such as Java, R, Python,
Weka, Keras, Plotly, H2O, etc.
5
6. © 2020 KNIME AG. All Rights Reserved.
Model training with labeled data
Workflow on the KNIME Hub:
https://kni.me/w/gwBpbUtj0awOERjg
7. © 2020 KNIME AG. All Rights Reserved.
The Final Goal of a Classification Model
7
Contact customers for no reason
vs. accept a higher amount of fraud
8. © 2020 KNIME AG. All Rights Reserved.
Model training with labeled data
Classification based on the predicted
positive class score
Optimize on Cohen’s
kappa
9. © 2020 KNIME AG. All Rights Reserved. 10
Classifying Imbalanced Data
10. © 2020 KNIME AG. All Rights Reserved.
Classifying Imbalanced Data
11
Accuracy = 99.9 %
Accuracy = 95.4 %
x
Fraudulent
Legitimate
% Correctly classified
x
Fraudulent
Legitimate
• Some accuracy metrics are not informative when the target class is imbalanced
y
y
y
99 %
51 %
% Correctly classified
y
98 %
93 %
11. © 2020 KNIME AG. All Rights Reserved.
• Resample data in order to make the target class distribution balanced
Handling Imbalanced Data
12
x
Fraudulent
Legitimate
y
x
Fraudulent
Legitimate
y
12. © 2020 KNIME AG. All Rights Reserved.
SMOTE
• Generate events into the
minority class
Undersampling
• Remove a random
sample of the majority
class events
Oversampling
• Duplicate a random
sample of the minority
class events
Resampling Techniques
13
Unbalanced data
x
Fraudulent
Legitimate
y
x
Fraudulent
Legitimate
y
x
Fraudulent
Legitimate
y
x
Fraudulent
Legitimate
y
13. © 2020 KNIME AG. All Rights Reserved. 14
Situation 2: The dataset has no fraud examples
14. © 2020 KNIME AG. All Rights Reserved.
Fraud detection using an unlabeled dataset
15
Fault Detection
Fraud Detection
Predictive Maintenance
Intrusion
Medicine
Heart Beat
Sensor Data
AssemblingDetails
Transactions
Networks
Finance
IoT
Weather Information
Fraud Detection
System Health Monitoring
15. © 2020 KNIME AG. All Rights Reserved.
What is an autoencoder?
16
Input Layer Hidden Layers Output Layer
Input 𝒙 Output 𝒙‘
Feature vector of a
transaction (time,
amount, etc.) Linear transformation
of the feature vector
Reconstructed feature
vector of a transaction
(time, amount, etc.)
Distance between 𝒙 and 𝒙‘
→ fraudulent or legitimate
16. © 2020 KNIME AG. All Rights Reserved.
Example of an autoencoder
17
Decoder
Training with numbers:
Input Compressed
representation
Reconstructed
input
− −= small = big
Encoder Decoder
Appling the trained autoencoder:
Encoder Decoder
Encoder
17. © 2020 KNIME AG. All Rights Reserved.
Fraud detection using an autoencoder
18
Workflow on the KNIME Hub:
https://kni.me/w/9qFNMrsuN4PH1hRg
18. © 2020 KNIME AG. All Rights Reserved.
Fraud detection using isolation forest
19
Workflow on the KNIME Hub:
https://kni.me/w/xSIWSAh_u-fwgi5B
19. © 2020 KNIME AG. All Rights Reserved.
Isolation forest algorithm
Idea: Outlier can be isolated with less random splits
20
𝑥1
𝑥2
𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1 𝑥1 𝑥1 𝑥1 𝑥1
𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1 𝑥1 𝑥1
𝑥2 𝑥2 𝑥2 𝑥2
𝑥2 𝑥2 𝑥2
𝑥1 𝑥1
𝑥2
𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1 𝑥1 𝑥1 𝑥1 𝑥1
𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1 𝑥1 𝑥1
𝑥2 𝑥2 𝑥2 𝑥2
𝑥2 𝑥2 𝑥2
𝑥1 𝑥1
𝑥1
𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1 𝑥1 𝑥1 𝑥1 𝑥1
𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1 𝑥1 𝑥1
𝑥2 𝑥2 𝑥2 𝑥2
𝑥2 𝑥2 𝑥2
𝑥2 𝑥2
𝑥1
𝑥1 𝑥1 𝑥1 𝑥1 𝑥1 𝑥1 𝑥1
𝑥2 𝑥2 𝑥2 𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1 𝑥1 𝑥1
𝑥2 𝑥2 𝑥2 𝑥2
𝑥2 𝑥2 𝑥2
𝑥1 𝑥1
𝑥2
→ shorter mean length,
i.e. less random splits
20. © 2020 KNIME AG. All Rights Reserved.
Fraud Detection in Labeled and Non-Labeled Data
• Fraud Detection Using a Neural Autoencoder
as #13 most read article on
• Fraud Detection using Random Forest, Neural Autoencoder, and Isolation
Forest techniques tutorial on
21
Follow the KNIME blog for more articles:
https://www.knime.com/blog
21. © 2020 KNIME AG. All Rights Reserved.
The KNIME Hub
22
https://hub.knime.com
22. © 2020 KNIME AG. All Rights Reserved.
The KNIME® trademark and logo and OPEN FOR INNOVATION®trademark are used by
KNIME AG under license from KNIME GmbH, and are registered in the United States.
KNIME® is also registered in Germany.
27
Thank You