SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Large Scale Machine Learning for Fraud
Prevention
Disclaimer: views expressed here are my own and do not
necessarily represent the views of PayPal, its affiliates or
subsidiaries.
Agenda Introduction
Fraud Prevention
Algorithm
Experiments
Conclusion
©2016 PayPal Inc. Confidential and proprietary.
INTRODUCTION
© 2016 PayPal Inc. Confidential and proprietary.
About Me
• Software Engineer/Data Scientist/ML Researcher
• Ph. D Computer Science
• Research in Face Recognition, Phishing/Spam, Fraud Prevention
4
Vibrant
developer
ecosyste
m
Efficient
payment
processing
with low SLA
active customer
accounts
200M
Secure
data
storage
&
handling
Traditional
& NoSQL
databases
PayPal operates
one of the largest
PRIVATE
CLOUDS
in the world
We have transformed
core business
processes into robust
SERVICE-BASED
PLATFORMS
The power of
our platform
Our technology transformation enables us to:
• Process payments at tremendous scale
• Accelerate the innovation of new products
• Engage world-class developers & technologists
About PayPal
FRAUD PREVENTION
Fraud Prevention @ PayPal
Robust feature engineering, machine
learning and statistical models
Highly scalable and multi-layered
infrastructure software
Superior team of data scientists,
researchers, financial and intelligence
analysts
Images source:
Fraud Prevention @ PayPal
• Employs advanced machine learning and statistical models to flag
fraudulent behavior up-front
• More sophisticated algorithms after transaction is complete
Transaction Level
• Monitor account level activity to identify abusive behavior
• Abusive pattern include frequent payments, suspicious profile
changes
Account Level
• Monitor account-to-account interaction
• Frequent transfer of money from several accounts to one central
account
Network Level
Fraud Prevention – What are we up against?
Fraudsters are becoming increasingly smarter and adaptive
Need cost-effective solutions that can model complex attack
patterns not previously observed
Need scalable and computationally efficient prediction models
© 2016 PayPal Inc. Confidential and proprietary.
Fraud Prevention – What are we up against?
• Much harder to get performance lift on
our flagship models
• Need to re-look at all aspects of
traditional model building
• Need out-of-the-box thinking
10
Area we are missing (AUC 0.96)
© 2016 PayPal Inc. Confidential and proprietary.
Fraud Prevention – What can we do to build better models?
11
feature1 …. featureN ……… Target
(Label)
d1
d2
…
dM
…..
Better
feature
Better
labeling
Advanced ML
Algorithms
Bigger
better data
LSML ALGORITHMS – ACTIVE LEARNING,
DEEP LEARNING, GBT
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning – What is it?
• Supervised learning algorithms require
data to be labeled
• Labelling is difficult, time-consuming
and expensive : Active Learning to the
rescue
• Idea – ML Algorithm can achieve better
accuracy if it is allowed to “choose the
data” from which it learns*
• Overcome labelling bottleneck by asking
queries (unlabeled data) to be labeled by
human
13
Unlabeled
Data
Labeled Data
Human Annotator
Machine Learning Model
(Re)Build Model
Select Queries
Source*: Burr Settles
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning – What is it?
• Scenarios
• Membership Query Synthesis – request labels for ‘any’
unlabeled instance in input space
• Stream-based Selective Sampling – unlabeled instance is drawn
one at a time & learner decides whether to discard or query
• Pool-based Sampling – instances are queried from a pool
according to informative-ness measure
14
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning – What is it?
• Query Strategy Frameworks
• Uncertainty Sampling
• Query-By-Committee
• Expected Model Change
• Expected Error Reduction
• Variance Reduction
• Density Weighted Methods
15
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning –Toy Example
16
Toy data – 400 instances Model using random sampling
70% accuracy
Model using active learning
Uncertainty sampling – 90% accuracy
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning For Fraud Prevention – Why is it unique?
17
• Data is unbalanced
• Fraud labelling require trained experts. Can’t be outsourced
• Fraud labelling is time consuming
• Fraud labelling require more than just individual instances. Require before
& after transactions
• Fraud labelling require data from other entities (ex: IP address)
• Fraud labelling require aggregate data
• Fraud tag mature at different times (ex: chargeback) & not instantaneous
© 2016 PayPal Inc. Confidential and proprietary.
Active Learning For Fraud Prevention – High Level Framework
18
Labeled
Data
Create Bags
Deep Learning
Model
GBT Model
(Re)Build Models
Unlabeled
Data
Predict
Query By Committee
Human Expert
Create
Statistics
Active
Feature
Engineering
Simulate
Features
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Deep Learning
19
Input Layer
Hidden Layers
Output Layer
• If a network has many layers of non-linearity, it is “deep”
• Need scalable platform
• Need lots of training data
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Deep Learning
20
•NetworkTopology – Feed forward
•Key Parameters
• # of hidden layers
• # of neurons @ each hidden layer
• Regularization
• Activation function
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Gradient BoostingTrees
21
• GBT = Gradient Descent + Boosting
• Fit an additive (ensemble) model in forward stage wise manner
• In each stage introduce a new model to compensate the shortcomings
of existing models
© 2016 PayPal Inc. Confidential and proprietary.
Modeling Algorithm – Gradient BoostingTrees
22
• Strengths
• No pre-processing required
• Robust
• Scalable
• Weaknesses
• Overfits (Need to find proper stopping point)
• Sensitive to noise
• Key Parameters
• # of trees
• Max depth
• Max observations
• Learning rate
EXPERIMENTS
© 2016 PayPal Inc. Confidential and proprietary.
Datasets
24
• Training Data
• 1 year
• 11 million transactions (1 million for active labelling)
• Test Data
• 4 months
• 4 million transactions
• # of features
• 500 - 600
© 2016 PayPal Inc. Confidential and proprietary.
Tools
25
• H2O
• Open source
• Scalable
• Robust
• Deep Learning & GBM implementations
• R
• Open source
• Active learning package
© 2016 PayPal Inc. Confidential and proprietary. 26
# of instances queried AUC (*weighted)
0 0.960
1000 0.961
10000 0.963
50000 0.971
100000 0.975
500000 0.977
1000000 0.979
Early Results – Active Learning Shows Promise…
CONCLUSIONS
© 2016 PayPal Inc. Confidential and proprietary.
Conclusions
28
• Deep learning & GBT has shown tremendous performance for fraud
detection.
• Active learning shows promise in improving performance of these
champion models
• Active learning also significantly reduce our labelling cost

Weitere ähnliche Inhalte

Was ist angesagt?

Webinar - Risky Business: How to Balance Innovation & Risk in Big Data
Webinar - Risky Business: How to Balance Innovation & Risk in Big DataWebinar - Risky Business: How to Balance Innovation & Risk in Big Data
Webinar - Risky Business: How to Balance Innovation & Risk in Big DataZaloni
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesCambridge Semantics
 
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereOvum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereZaloni
 
The 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeThe 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeDataWorks Summit
 
Marketing Digital Command Center
Marketing Digital Command CenterMarketing Digital Command Center
Marketing Digital Command CenterDataWorks Summit
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareMapR Technologies
 
Building A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopBuilding A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopCraig Warman
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Cambridge Semantics
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsCambridge Semantics
 
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerWebinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerZaloni
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...DataWorks Summit/Hadoop Summit
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo UnstructuredCambridge Semantics
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteMark van Rijmenam
 

Was ist angesagt? (20)

Webinar - Risky Business: How to Balance Innovation & Risk in Big Data
Webinar - Risky Business: How to Balance Innovation & Risk in Big DataWebinar - Risky Business: How to Balance Innovation & Risk in Big Data
Webinar - Risky Business: How to Balance Innovation & Risk in Big Data
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success Stories
 
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereOvum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
 
The 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeThe 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data Lake
 
Marketing Digital Command Center
Marketing Digital Command CenterMarketing Digital Command Center
Marketing Digital Command Center
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShare
 
Big Data Telecom
Big Data TelecomBig Data Telecom
Big Data Telecom
 
Building A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopBuilding A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on Hadoop
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
 
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerWebinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
 
Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics Hybrid Cloud Strategy for Big Data and Analytics
Hybrid Cloud Strategy for Big Data and Analytics
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Ibm big data
Ibm big dataIbm big data
Ibm big data
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 

Ähnlich wie Large Scale ML for Fraud Prevention Using Active Learning and Advanced Algorithms

ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Practical model management in the age of Data science and ML
Practical model management in the age of Data science and MLPractical model management in the age of Data science and ML
Practical model management in the age of Data science and MLQuantUniversity
 
Saama-POI Summit Speaker Deck April 2016 Final
Saama-POI Summit Speaker Deck April 2016 FinalSaama-POI Summit Speaker Deck April 2016 Final
Saama-POI Summit Speaker Deck April 2016 FinalDan Maxwell
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceData Science Milan
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
 
Predictive analytics from a to z
Predictive analytics from a to zPredictive analytics from a to z
Predictive analytics from a to zalpinedatalabs
 
Predictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE RomePredictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE RomeSAP Ariba
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Pixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at ScalePixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at ScaleAntónio Alegria
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
Presentación Paco Bermejo - La Noche del Sector Financiero
Presentación Paco Bermejo - La Noche del Sector FinancieroPresentación Paco Bermejo - La Noche del Sector Financiero
Presentación Paco Bermejo - La Noche del Sector FinancieroJorge Puebla Fernández
 
Launching PayPal - The eBay PayPal Tech Separation
Launching PayPal - The eBay PayPal Tech SeparationLaunching PayPal - The eBay PayPal Tech Separation
Launching PayPal - The eBay PayPal Tech SeparationSri Shivananda
 
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward
 
Skylads - Big Data for Telcos
Skylads - Big Data for TelcosSkylads - Big Data for Telcos
Skylads - Big Data for TelcosXavier Litt
 
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor..."The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...Quantopian
 

Ähnlich wie Large Scale ML for Fraud Prevention Using Active Learning and Advanced Algorithms (20)

Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
A6 big data_in_the_cloud
A6 big data_in_the_cloudA6 big data_in_the_cloud
A6 big data_in_the_cloud
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
 
Practical model management in the age of Data science and ML
Practical model management in the age of Data science and MLPractical model management in the age of Data science and ML
Practical model management in the age of Data science and ML
 
Saama-POI Summit Speaker Deck April 2016 Final
Saama-POI Summit Speaker Deck April 2016 FinalSaama-POI Summit Speaker Deck April 2016 Final
Saama-POI Summit Speaker Deck April 2016 Final
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 
Managing AI Products
Managing AI ProductsManaging AI Products
Managing AI Products
 
Predictive analytics from a to z
Predictive analytics from a to zPredictive analytics from a to z
Predictive analytics from a to z
 
Predictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE RomePredictive Analytics: Better Commerce Insight | Ariba LIVE Rome
Predictive Analytics: Better Commerce Insight | Ariba LIVE Rome
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Pixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at ScalePixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at Scale
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Presentación Paco Bermejo - La Noche del Sector Financiero
Presentación Paco Bermejo - La Noche del Sector FinancieroPresentación Paco Bermejo - La Noche del Sector Financiero
Presentación Paco Bermejo - La Noche del Sector Financiero
 
Launching PayPal - The eBay PayPal Tech Separation
Launching PayPal - The eBay PayPal Tech SeparationLaunching PayPal - The eBay PayPal Tech Separation
Launching PayPal - The eBay PayPal Tech Separation
 
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
 
Skylads - Big Data for Telcos
Skylads - Big Data for TelcosSkylads - Big Data for Telcos
Skylads - Big Data for Telcos
 
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor..."The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...
"The Hunt For Alpha Among Alternative Data Sources" by Dr. Michael Halls-Moor...
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 

Kürzlich hochgeladen (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 

Large Scale ML for Fraud Prevention Using Active Learning and Advanced Algorithms

  • 1. Large Scale Machine Learning for Fraud Prevention Disclaimer: views expressed here are my own and do not necessarily represent the views of PayPal, its affiliates or subsidiaries.
  • 4. © 2016 PayPal Inc. Confidential and proprietary. About Me • Software Engineer/Data Scientist/ML Researcher • Ph. D Computer Science • Research in Face Recognition, Phishing/Spam, Fraud Prevention 4
  • 5. Vibrant developer ecosyste m Efficient payment processing with low SLA active customer accounts 200M Secure data storage & handling Traditional & NoSQL databases PayPal operates one of the largest PRIVATE CLOUDS in the world We have transformed core business processes into robust SERVICE-BASED PLATFORMS The power of our platform Our technology transformation enables us to: • Process payments at tremendous scale • Accelerate the innovation of new products • Engage world-class developers & technologists About PayPal
  • 7. Fraud Prevention @ PayPal Robust feature engineering, machine learning and statistical models Highly scalable and multi-layered infrastructure software Superior team of data scientists, researchers, financial and intelligence analysts Images source:
  • 8. Fraud Prevention @ PayPal • Employs advanced machine learning and statistical models to flag fraudulent behavior up-front • More sophisticated algorithms after transaction is complete Transaction Level • Monitor account level activity to identify abusive behavior • Abusive pattern include frequent payments, suspicious profile changes Account Level • Monitor account-to-account interaction • Frequent transfer of money from several accounts to one central account Network Level
  • 9. Fraud Prevention – What are we up against? Fraudsters are becoming increasingly smarter and adaptive Need cost-effective solutions that can model complex attack patterns not previously observed Need scalable and computationally efficient prediction models
  • 10. © 2016 PayPal Inc. Confidential and proprietary. Fraud Prevention – What are we up against? • Much harder to get performance lift on our flagship models • Need to re-look at all aspects of traditional model building • Need out-of-the-box thinking 10 Area we are missing (AUC 0.96)
  • 11. © 2016 PayPal Inc. Confidential and proprietary. Fraud Prevention – What can we do to build better models? 11 feature1 …. featureN ……… Target (Label) d1 d2 … dM ….. Better feature Better labeling Advanced ML Algorithms Bigger better data
  • 12. LSML ALGORITHMS – ACTIVE LEARNING, DEEP LEARNING, GBT
  • 13. © 2016 PayPal Inc. Confidential and proprietary. Active Learning – What is it? • Supervised learning algorithms require data to be labeled • Labelling is difficult, time-consuming and expensive : Active Learning to the rescue • Idea – ML Algorithm can achieve better accuracy if it is allowed to “choose the data” from which it learns* • Overcome labelling bottleneck by asking queries (unlabeled data) to be labeled by human 13 Unlabeled Data Labeled Data Human Annotator Machine Learning Model (Re)Build Model Select Queries Source*: Burr Settles
  • 14. © 2016 PayPal Inc. Confidential and proprietary. Active Learning – What is it? • Scenarios • Membership Query Synthesis – request labels for ‘any’ unlabeled instance in input space • Stream-based Selective Sampling – unlabeled instance is drawn one at a time & learner decides whether to discard or query • Pool-based Sampling – instances are queried from a pool according to informative-ness measure 14
  • 15. © 2016 PayPal Inc. Confidential and proprietary. Active Learning – What is it? • Query Strategy Frameworks • Uncertainty Sampling • Query-By-Committee • Expected Model Change • Expected Error Reduction • Variance Reduction • Density Weighted Methods 15
  • 16. © 2016 PayPal Inc. Confidential and proprietary. Active Learning –Toy Example 16 Toy data – 400 instances Model using random sampling 70% accuracy Model using active learning Uncertainty sampling – 90% accuracy
  • 17. © 2016 PayPal Inc. Confidential and proprietary. Active Learning For Fraud Prevention – Why is it unique? 17 • Data is unbalanced • Fraud labelling require trained experts. Can’t be outsourced • Fraud labelling is time consuming • Fraud labelling require more than just individual instances. Require before & after transactions • Fraud labelling require data from other entities (ex: IP address) • Fraud labelling require aggregate data • Fraud tag mature at different times (ex: chargeback) & not instantaneous
  • 18. © 2016 PayPal Inc. Confidential and proprietary. Active Learning For Fraud Prevention – High Level Framework 18 Labeled Data Create Bags Deep Learning Model GBT Model (Re)Build Models Unlabeled Data Predict Query By Committee Human Expert Create Statistics Active Feature Engineering Simulate Features
  • 19. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Deep Learning 19 Input Layer Hidden Layers Output Layer • If a network has many layers of non-linearity, it is “deep” • Need scalable platform • Need lots of training data
  • 20. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Deep Learning 20 •NetworkTopology – Feed forward •Key Parameters • # of hidden layers • # of neurons @ each hidden layer • Regularization • Activation function
  • 21. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Gradient BoostingTrees 21 • GBT = Gradient Descent + Boosting • Fit an additive (ensemble) model in forward stage wise manner • In each stage introduce a new model to compensate the shortcomings of existing models
  • 22. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Gradient BoostingTrees 22 • Strengths • No pre-processing required • Robust • Scalable • Weaknesses • Overfits (Need to find proper stopping point) • Sensitive to noise • Key Parameters • # of trees • Max depth • Max observations • Learning rate
  • 24. © 2016 PayPal Inc. Confidential and proprietary. Datasets 24 • Training Data • 1 year • 11 million transactions (1 million for active labelling) • Test Data • 4 months • 4 million transactions • # of features • 500 - 600
  • 25. © 2016 PayPal Inc. Confidential and proprietary. Tools 25 • H2O • Open source • Scalable • Robust • Deep Learning & GBM implementations • R • Open source • Active learning package
  • 26. © 2016 PayPal Inc. Confidential and proprietary. 26 # of instances queried AUC (*weighted) 0 0.960 1000 0.961 10000 0.963 50000 0.971 100000 0.975 500000 0.977 1000000 0.979 Early Results – Active Learning Shows Promise…
  • 28. © 2016 PayPal Inc. Confidential and proprietary. Conclusions 28 • Deep learning & GBT has shown tremendous performance for fraud detection. • Active learning shows promise in improving performance of these champion models • Active learning also significantly reduce our labelling cost