Machine Learning CI/CD for Email Attack Detection

Continuous ML Integration & Delivery
for
Advanced Email Attack Detection
Jeshua Bratman & Justin Young

www.abnormalsecurity.com
The Detection Problem
From: “Josephine Wright” <invoicing@edisonpower.com>
To: “Tim James” <accounts@northwestmercyhospitals.com>
Subject: “Invoice details for September electricity service”
Hi Tim,
September invoice is ready! Please pay the attached invoice amount of
$883,000 for electricity services for Northwest Mercy Hospitals.
ABA: 12321001
Routing#: 123456789
-Jo
Invoice Payment Fraud!

Advanced Social Engineering
Phishing,
Spear
Phishing,
Malware
Spam
Graymail
Business
Email
Compromise
Extortion
Compromised
Employee
Invoice Fraud
Heists
Scam
Compromised
Vendor
Legitimate Email
More Damaging & Sophisticated & Rare
~25% of emails
~25% of emails
~50% of emails
<.1% of emails
<.01% of emails
< 1 in a 100k emails
< 1 in a million emails
< 1 in 10 million emails

This is a hard machine learning problem
1. Rarity of attacks
1. Adversarial Attack Landscape
1. High-dimensional & high data volume
1. Need Extremely high precision and recall simultaneously

Move Fast!
Lightning speed iteration to get ahead of new attacks
Don’t Break Things!
We don’t want to stop catching old attacks
Continuous Integration and Delivery (CI/CD) for our ENTIRE ML Detection Engine

Part 2:
CI/CD for a
Machine Learning
Detection Engine
How do we develop quickly without breaking
things?

Code
Engineer
Modifies
Land & Deploy
Traditional CI/CD
Tests
Do the Tests
Pass?

No idea if code change breaks the system
Engineers fixing each others bugs all the time
Pushing bad code to production
In modern software development it would be insane not to have CI/CD
What happens if we *do not* have CI/CD?

Tests
Machine Learning CI/CD
Rescoring
Analytics
Model Training
Deployment
Do the tests pass?
Is performance good?
Can new models train?
Code
ML Engineer
Modifies
Models
Datasets

Cannot safely change system to fix an FN or FP
May degrade system unintentionally when shipping improvements
Cannot know overall impact of new model to entire system
Most ML products run blind like this! It greatly hampers development speed and product stability.
What happens if we *do not* have CI/CD?

Adversarial!
From: “Josephine Wright” <invoicing@edisonpower.com>
Hi Tim,
September invoice ready! Please pay the attached invoice
amount of $883,000 for electricity services for Northwest Mercy
Hospitals.
ABA: 12321001
Routing#: 123456789
-Jo
From: “Josephine Wright” <invoicing@edisonpovver.com>
Hi Tim,
Just wanted to update you, we recently had to switch banks
(long story) but our account number has changed for future
invoices. See attached document for updated banking details.
-Josephine
Attachment: BankDetails.pdf
New Attack Strategy
Billing Account Update Fraud!
Invoice Payment Fraud!

OK, how would we use this
Hi Tim,
-Josephine
New or improved NLP models to identify
language around changing bank
accounts
New code to parse pdfs and extract bank
account numbers from them
New counting features for how often a
sender uses a particular domain, new
code with feature extractor, and a model
that uses those features

Code
ML Engineer
Modifies:
Machine Learning CI/CD Details
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples

Accurate
● Rescoring analytics reflect performance in production
● Training data is unbiased (including time travel to avoid future leakage)
ML Engineer Effectiveness
● Easy and fast to run by engineers for retraining and evaluation
● Can add new models, datasets, features easily
Requirements of good CI/CD for ML

Part 3:
Designing the
System
How do we build a CI/CD platform for our ML system
that enables developers and also scales well?

So how do we do this?
This is a big data
problem! Data, models,
and code are all part of
the software system
we’re testing
So, we’ll use Spark to
simulate our online
system. But things get
complicated fast...
Code
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples

A Familiar ML Story
Hi Tim,
-Josephine
New counting features for how often a
sender uses a particular domain, new code
with feature extractor, and a model that
uses those features
A data scientist has a great new feature…
but how do we safely get it into
production?
Domain
Count
Dataset
...
(“Josephine Wright”, “edisonpower.com”): 1000,
(“Josephine Wright”, “edisonpovver.com”): 0,
...

A Familiar ML Story
A data scientist has a
great new feature… but
how do we safely get it
into production?
For just the new domain count
feature:
1. Domain Count Dataset
2. Feature extraction code
3. New sub-model?
Domain
Count
Dataset
...
...

What does it look like to test this new feature?
In a typical software test,
we can mock out
complex dependencies
But for ML, we can’t
mock the data!
Does every data
scientist have to become
a data engineer?
Domain
Count
Dataset
Code
Models
Rescoring
Analytics
Model Training
Datasets
ML
Detection
Engine
Labeled
Samples

Adding Our New Dataset
SparkFiles
Download dataset to disk on each executor
Broadcast Variable
Broadcast dataset in memory in each PySpark
process
What would it look like for our data scientist
to add the new dataset?

SparkFiles
Broadcast Variable
process
# Broadcast variable to every executor
small_ip_dataset = {“1.2.3.4”: 123, “5.6.7.8”: 567}
ip_broadcast = sc.broadcast(dataset1)
# hydrate_with_ip_count can use the
small_ip_dataset dictionary
hydrated_rdd = rdd.map(lambda message:
hydrate_with_ip_count(message, ip_broadcast.value))
from pyspark import SparkFiles
# Add Spark file so that every executor will
download it
sc.addFile(remote_dataset_path)
# Now the file can be loaded in any Spark operation
from local_dataset_path
local_dataset_path =
SparkFiles.get(os.path.basename(remote_dataset_path
)[: -len(".tar.gz")])

SparkFiles
Broadcast Variable
process
Spark Join
Join large distributed datasets via Spark
operations
What would it look like for our data scientist
to add the new dataset?
Domain
Count
Dataset

Wait, what about time travel?
50
Hydration of counting
feature up to time t
48
Time
Hydration of counting
feature up to time t-x
...

Feature Hydration With Time Travel
Sum over time
Domain
Count
Dataset
Daily Counts
Cumulative
Counts

Events
Time-bucket
and key

Hydrated
Events
Join By Key +
Time

Deep Dive: Re-hydrating Behavior Graph
# Index every event by key and day, and take event ID to avoid passing around large objects
keyed_event_id_rdd = _expand_events_by_key_day(event_rdd)
# Index every count by key and day
keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds)
# Join date-indexed event ID’s with date-indexed counts, by common key
joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd)
# In memory, sum up cumulative counts and key by event ID
cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap(
_extract_cumulative_counts
)
# Join actual events back in by event ID
joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join(
event_rdd.keyBy(_get_id_from_event)
)
# Hydrate every event with cumulative counts
hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map(
_hydrate_event_with_counts
)

Back To Our ML Story
So we can do all of this in Spark.
But no data scientist should ever
have to think about this!
Data engineers should go to
great efforts to provide a simple
platform that hides these details
Data scientists should spend as
much time as possible doing data
science
Domain
Count
Dataset
...
...
# Index every event by key and day, and take event ID to avoid passing around large objects
keyed_event_id_rdd = _expand_events_by_key_day(event_rdd)
# Index every count by key and day
keyed_counts_rdd = _expand_counts_by_key_day(time_sliced_counts_rdds)
# Join date-indexed event ID’s with date-indexed counts, by common key
joined_event_id_and_daily_counts_rdd = keyed_event_id_rdd.leftOuterJoin(keyed_counts_rdd)
# In memory, sum up cumulative counts and key by event ID
cumulative_counts_by_event_id_rdd = joined_event_id_and_daily_counts_rdd.flatMap(
_extract_cumulative_counts
)
# Join actual events back in by event ID
joined_event_and_cumulative_counts_rdd = cumulative_counts_by_event_id_rdd.join(
event_rdd.keyBy(_get_id_from_event)
)
# Hydrate every event with cumulative counts
hydrated_events_rdd = joined_event_and_cumulative_counts_rdd.map(
_hydrate_event_with_counts
)

Re-scoring Is Part of the MLOps Platform
Data engineers have to make re-
scoring as easy to use as
traditional CI/CD
This means providing a playbook
that’s as easy as adding unit tests
Domain
Count
Dataset
...
...

Re-scoring Is Part of the MLOps Platform
Data engineers have to make re-
scoring as easy to use as
traditional CI/CD
This means providing a playbook
that’s as easy as adding unit tests
Domain
Count
Dataset
...
...
class TimeSlicedStatsEventHydrater(Generic[Stat, Event]):
# Class for building set of stats to lookup
_lookup_stats_builder: LookupStatsBuilder
# How to hydrate the Event with the Stats
_hydrate_event: EventHydrater
# Takes in an event and returns the date on which it occurred
_get_date_from_event: DateExtractor
# Takes in an event and returns its ID
_get_id_from_event: IdExtractor

Accurate
● Rescoring analytics reflect performance in production
● Training data is unbiased (including time travel to avoid future leakage)
ML Engineer Effectiveness
● Easy and fast to run by engineers for retraining and evaluation
● Can add new models, datasets, features easily
Data Engineer Jobs-to-be-done
● Provide simple API that just works
● Make the system efficient enough to run on a regular schedule and ad-hoc
Requirements of good CI/CD for ML

Quickly iterate
Know if things break
Train models on old examples
You will have a better & more flexible product
You will be able to address customer requests quickly
You will be able to support a larger team of ML engineers working in parallel
What happens if we DO have CI/CD?

We’re Hiring!
abnormalsecurity.com/careers/

Thank You

Machine Learning CI/CD for Email Attack Detection

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Machine Learning CI/CD for Email Attack Detection

Ähnlich wie Machine Learning CI/CD for Email Attack Detection (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Machine Learning CI/CD for Email Attack Detection