Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models.
In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.
1. Continuous ML Integration & Delivery
for
Advanced Email Attack Detection
Jeshua Bratman & Justin Young
2. www.abnormalsecurity.com
The Detection Problem
From: “Josephine Wright” <invoicing@edisonpower.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
September invoice is ready! Please pay the attached invoice amount of
$883,000 for electricity services for Northwest Mercy Hospitals.
ABA: 12321001
Routing#: 123456789
-Jo
Invoice Payment Fraud!
3. www.abnormalsecurity.com
The Detection Problem
Advanced Social Engineering
Phishing,
Spear
Phishing,
Malware
Spam
Graymail
Business
Email
Compromise
Extortion
Compromised
Employee
Invoice Fraud
Heists
Scam
Compromised
Vendor
Legitimate Email
More Damaging & Sophisticated & Rare
~25% of emails
~25% of emails
~50% of emails
<.1% of emails
<.01% of emails
< 1 in a 100k emails
< 1 in a million emails
< 1 in 10 million emails
4. www.abnormalsecurity.com
The Detection Problem
This is a hard machine learning problem
1. Rarity of attacks
1. Adversarial Attack Landscape
1. High-dimensional & high data volume
1. Need Extremely high precision and recall simultaneously
5. www.abnormalsecurity.com
Move Fast!
Lightning speed iteration to get ahead of new attacks
Don’t Break Things!
We don’t want to stop catching old attacks
Continuous Integration and Delivery (CI/CD) for our ENTIRE ML Detection Engine
8. www.abnormalsecurity.com
No idea if code change breaks the system
Engineers fixing each others bugs all the time
Pushing bad code to production
In modern software development it would be insane not to have CI/CD
What happens if we *do not* have CI/CD?
10. www.abnormalsecurity.com
Cannot safely change system to fix an FN or FP
May degrade system unintentionally when shipping improvements
Cannot know overall impact of new model to entire system
Most ML products run blind like this! It greatly hampers development speed and product stability.
What happens if we *do not* have CI/CD?
11. www.abnormalsecurity.com
Adversarial!
From: “Josephine Wright” <invoicing@edisonpower.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
September invoice ready! Please pay the attached invoice
amount of $883,000 for electricity services for Northwest Mercy
Hospitals.
ABA: 12321001
Routing#: 123456789
-Jo
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
Just wanted to update you, we recently had to switch banks
(long story) but our account number has changed for future
invoices. See attached document for updated banking details.
-Josephine
Attachment: BankDetails.pdf
New Attack Strategy
Billing Account Update Fraud!
Invoice Payment Fraud!
12. www.abnormalsecurity.com
OK, how would we use this
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
Just wanted to update you, we recently had to switch banks
(long story) but our account number has changed for future
invoices. See attached document for updated banking details.
-Josephine
Attachment: BankDetails.pdf
Billing Account Update Fraud!
New or improved NLP models to identify
language around changing bank
accounts
New code to parse pdfs and extract bank
account numbers from them
New counting features for how often a
sender uses a particular domain, new
code with feature extractor, and a model
that uses those features
14. www.abnormalsecurity.com
Accurate
● Rescoring analytics reflect performance in production
● Training data is unbiased (including time travel to avoid future leakage)
ML Engineer Effectiveness
● Easy and fast to run by engineers for retraining and evaluation
● Can add new models, datasets, features easily
Requirements of good CI/CD for ML
16. www.abnormalsecurity.com
So how do we do this?
This is a big data
problem! Data, models,
and code are all part of
the software system
we’re testing
So, we’ll use Spark to
simulate our online
system. But things get
complicated fast...
Code
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples
17. www.abnormalsecurity.com
A Familiar ML Story
From: “Josephine Wright” <invoicing@edisonpovver.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
Just wanted to update you, we recently had to switch banks
(long story) but our account number has changed for future
invoices. See attached document for updated banking details.
-Josephine
Attachment: BankDetails.pdf
Billing Account Update Fraud!
New counting features for how often a
sender uses a particular domain, new code
with feature extractor, and a model that
uses those features
A data scientist has a great new feature…
but how do we safely get it into
production?
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
18. www.abnormalsecurity.com
A Familiar ML Story
A data scientist has a
great new feature… but
how do we safely get it
into production?
For just the new domain count
feature:
1. Domain Count Dataset
2. Feature extraction code
3. New sub-model?
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
19. www.abnormalsecurity.com
What does it look like to test this new feature?
In a typical software test,
we can mock out
complex dependencies
But for ML, we can’t
mock the data!
Does every data
scientist have to become
a data engineer?
Domain
Count
Dataset
Code
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples
20. www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Broadcast dataset in memory in each PySpark
process
What would it look like for our data scientist
to add the new dataset?
21. www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Broadcast dataset in memory in each PySpark
process
# Broadcast variable to every executor
small_ip_dataset = {“1.2.3.4”: 123, “5.6.7.8”: 567}
ip_broadcast = sc.broadcast(dataset1)
# hydrate_with_ip_count can use the
small_ip_dataset dictionary
hydrated_rdd = rdd.map(lambda message:
hydrate_with_ip_count(message, ip_broadcast.value))
from pyspark import SparkFiles
# Add Spark file so that every executor will
download it
sc.addFile(remote_dataset_path)
# Now the file can be loaded in any Spark operation
from local_dataset_path
local_dataset_path =
SparkFiles.get(os.path.basename(remote_dataset_path
)[: -len(".tar.gz")])
22. www.abnormalsecurity.com
Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Broadcast dataset in memory in each PySpark
process
Spark Join
Join large distributed datasets via Spark
operations
What would it look like for our data scientist
to add the new dataset?
Domain
Count
Dataset
27. www.abnormalsecurity.com
Deep Dive: Re-hydrating Behavior Graph
# Index every event by key and day, and take event ID to avoid passing around large objects
keyed_event_id_rdd = _expand_events_by_key_day(event_rdd)
# Index every count by key and day
keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds)
# Join date-indexed event ID’s with date-indexed counts, by common key
joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd)
# In memory, sum up cumulative counts and key by event ID
cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap(
_extract_cumulative_counts
)
# Join actual events back in by event ID
joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join(
event_rdd.keyBy(_get_id_from_event)
)
# Hydrate every event with cumulative counts
hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map(
_hydrate_event_with_counts
)
28. www.abnormalsecurity.com
Back To Our ML Story
So we can do all of this in Spark.
But no data scientist should ever
have to think about this!
Data engineers should go to
great efforts to provide a simple
platform that hides these details
Data scientists should spend as
much time as possible doing data
science
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
# Index every event by key and day, and take event ID to avoid passing around large objects
keyed_event_id_rdd = _expand_events_by_key_day(event_rdd)
# Index every count by key and day
keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds)
# Join date-indexed event ID’s with date-indexed counts, by common key
joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd)
# In memory, sum up cumulative counts and key by event ID
cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap(
_extract_cumulative_counts
)
# Join actual events back in by event ID
joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join(
event_rdd.keyBy(_get_id_from_event)
)
# Hydrate every event with cumulative counts
hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map(
_hydrate_event_with_counts
)
29. www.abnormalsecurity.com
Re-scoring Is Part of the MLOps Platform
Data engineers have to make re-
scoring as easy to use as
traditional CI/CD
This means providing a playbook
that’s as easy as adding unit tests
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
30. www.abnormalsecurity.com
Re-scoring Is Part of the MLOps Platform
Data engineers have to make re-
scoring as easy to use as
traditional CI/CD
This means providing a playbook
that’s as easy as adding unit tests
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...
class TimeSlicedStatsEventHydrater(Generic[Stat, Event]):
# Class for building set of stats to lookup
_lookup_stats_builder: LookupStatsBuilder
# How to hydrate the Event with the Stats
_hydrate_event: EventHydrater
# Takes in an event and returns the date on which it occurred
_get_date_from_event: DateExtractor
# Takes in an event and returns its ID
_get_id_from_event: IdExtractor
31. www.abnormalsecurity.com
Accurate
● Rescoring analytics reflect performance in production
● Training data is unbiased (including time travel to avoid future leakage)
ML Engineer Effectiveness
● Easy and fast to run by engineers for retraining and evaluation
● Can add new models, datasets, features easily
Data Engineer Jobs-to-be-done
● Provide simple API that just works
● Make the system efficient enough to run on a regular schedule and ad-hoc
Requirements of good CI/CD for ML
32. www.abnormalsecurity.com
Quickly iterate
Know if things break
Train models on old examples
You will have a better & more flexible product
You will be able to address customer requests quickly
You will be able to support a larger team of ML engineers working in parallel
What happens if we DO have CI/CD?