Mongo at Sailthru (MongoNYC 2011)

MongoDB at Sailthru

Ian White
@eonwhite
MongoNYC 2011
6/7/11

Sailthru
• API-based transactional email led to...
• Mass campaign email led to...
• Intelligence and user behavior
• Three engineers built the ESP we always
wanted to use
• Some Clients: Huffpo-AOL, Thrillist,
Reﬁnery 29, Flavorpill, Business Insider,
Lot18, Fab, New York Observer

How We Got To
MongoDB from SQL
• JSON was part of Sailthru infrastructure
from start (SQL columns and S3)
• Kept a close eye on CouchDB project
• MongoDB felt like natural ﬁt
• Used for user proﬁles and analytics initially
• Migrated one table at a time (very, very
carefully)

Sailthru Architecture
• User interface to display stats, build
campaigns and templates, etc (PHP/EC2)
• API, link rewriting, and onsite endpoints
(PHP/EC2)
• Core mailer engine (Java/EC2 and colo)
• Modiﬁed-postﬁx SMTP servers (colo)
• 11 database servers on EC2 (for now)

MongoDB Overview

• 11 instances on EC2 (5 two-member
replica sets, 1 backup server)
• About 40 collections
• About 1TB
• Largest single collection is 500m docs

Users are Documents

• Users aren’t records split among multiple
tables
• End user’s lists, clickstream interests,
geolocation, browser, time of day, purchase
history becomes one ever-growing
document

User Proﬁle
{ "_id" : ObjectId("4b2d368aed948543a5fca4b4"), "browser" : { "Chrome" : 3, "Firefox" : 1, "iPhone" : 2 }, "click_count" : 1, "click_time" :
"Wed Feb 17 2010 09:03:37 GMT-0500 (EST)", "client_id" : 450, "email" : "ibwhite@gmail.com", "email_hour" : { "13" : 1, "14" : 2, "16" : 2,
"17" : 2, "18" : 3, "21" : 2 }, "geo" : { "city" : { "New York, NY US" : 3, "Sterling, VA US" : 1 }, "count" : 6, "country" : { "US" : 6 },
"state" : { "NY US" : 3, "VA US" : 1 }, "zip" : { "10011 US" : 1, "10065 US" : 1 } }, "horizon" : { "admob" : 1, "advertising" : 3,
"afghanistan" : 1, "aig" : 2, "airline-industry" : 2, "alleyinsider" : 45, "analyst-research" : 1, "apple" : 25, "apple-tablet" : 5, "att" :
8, "bailout" : 5, "banks" : 6, "barack-obama" : 25, "ben-bernanke" : 1, "big-tech" : 17, "billionaires" : 1, "boats" : 1, "bonus" : 6, "bp" :
1, "budget" : 1, "cable" : 1, "caribbean" : 2, "cars" : 5, "chart-of-the-day" : 3, "china" : 3, "clusterstock" : 36, "cnbc" : 1, "comcast" :
1, "commodities" : 3, "conan-obrien" : 6, "crime" : 2, "curbedcom" : 1, "death-of-tv" : 1, "debt" : 7, "deepwater-horizon-oil-spill" : 1,
"dell" : 4, "development" : 1, "dick-fuld" : 1, "economy" : 10, "education" : 1, "employment" : 2, "entertainment" : 7, "europe" : 1,
"facebook" : 4, "features" : 13, "financial-crisis" : 7, "financial-services" : 2, "fox" : 4, "fraud" : 1, "futures" : 1, "gadgets" : 21,
"gas" : 1, "gawker" : 5, "gold" : 3, "goldman-sachs" : 1, "google" : 7, "green" : 5, "green-tech" : 2, "health" : 5, "health-care-reform" :
7, "hedge-funds" : 3, "hires-and-fires" : 1, "housing-crisis" : 1, "hp" : 4, "hulu" : 2, "humor" : 1, "iad" : 1, "international" : 3,
"investing" : 5, "ios" : 1, "ipad" : 2, "iphone" : 10, "jay-leno" : 5, "jim-cramer" : 1, "jobs" : 2, "john-gruber" : 2, "law-firms" : 1,
"lawreview" : 3, "lehman-brothers" : 1, "litigation" : 5, "luxury" : 1, "mac" : 1, "magazines" : 1, "markets" : 7, "media" : 20,
"mercedesbenz" : 4, "microsoft" : 1, "mining" : 1, "mobile" : 14, "mobile-ads" : 2, "moguls" : 1, "money" : 6, "money-media" : 2,
"moneygame" : 16, "morningstar" : 3, "mortgages" : 1, "mtv" : 1, "nbc" : 6, "new-york" : 1, "new-york-times" : 4, "news" : 9, "newspapers" :
5, "nouriel-roubini" : 6, "oil" : 1, "online" : 10, "optimum-energy" : 4, "paul-krugman" : 3, "people" : 5, "politics" : 26, "radio" : 1,
"real-estate" : 2, "recession" : 4, "regulation" : 12, "sai" : 15, "satellite-radio" : 1, "scandals" : 5, "security" : 1, "senate" : 4,
"silicon-alley-insider" : 1, "sirius" : 1, "social-networking" : 3, "sports" : 1, "startups" : 1, "steve-jobs" : 1, "stimulus" : 1, "stock-
market" : 5, "stocks" : 3, "tax-cuts" : 1, "taxes" : 1, "tbi" : 163, "tbi-live" : 3, "terrorism" : 3, "the-atlantic" : 1, "the-way-we-live-
now" : 1, "themoneygame" : 3, "thewire" : 17, "tim-geithner" : 3, "time-warner-cable" : 1, "transportation" : 7, "treasury" : 2, "tv" : 7,
"tv-everywhere" : 1, "twitter" : 3, "uk" : 1, "unemployment" : 2, "us-government" : 8, "verizon" : 4, "video" : 6, "wall-st-cheat-sheet" : 1,
"wall-street" : 25, "wall-street-journal" : 1, "warren-buffett" : 1, "white-house" : 4, "wwdc-2010" : 1, "yachts" : 1, "10gen" : 1, "2010-
world-cup" : 1 }, "horizon_count" : 303, "horizon_time" : "Tue Dec 07 2010 15:26:35 GMT-0500 (EST)", "lists" : [ "TBI Research 1 - Beta",
"Dedicated Email", "TBI Research", "411" ], "lists_signup" : { "BI_iphone App" : null, "Clusterstock Chart Of The Day" : null, "Clusterstock
Select" : null, "Dedicated Email" : "Tue Dec 22 2009 13:29:43 GMT-0500 (EST)", "Dedicated Email - The Ladders" : null, "Green Sheet Select" :
null, "Insider 411" : null, "Insider 411 - Economist" : null, "Insider 411 - Ooyala" : null, "Insider 411 - The Wire Promo" : null, "Insider
411- Economist" : null, "Law Review Select" : null, "Media Select" : null, "Silicon Alley Insider Chart Of The Day" : null, "Silicon Alley
Insider Select" : null, "TBI Research" : "Tue Jan 05 2010 13:58:09 GMT-0500 (EST)", "TBI Research 1 - Beta" : "Mon Nov 09 2009 12:34:58
GMT-0500 (EST)", "TBI Select" : null, "The Money Game Select" : null, "War Room Select" : null, "z_sailthru" : null, "10 Things Before the
Opening Bell" : null, "411" : "Wed Jul 07 2010 11:28:03 GMT-0400 (EDT)" }, "open_count" : 11, "open_time" : "Tue Dec 07 2010 13:30:31
GMT-0500 (EST)", "optout_templates" : [ ], "order" : 12, "signup_time" : "Mon Nov 09 2009 12:34:58 GMT-0500 (EST)", "site_hour" : { "20" :
1 }, "status" : null, "status_time" : "Thu Jan 06 2011 11:09:54 GMT-0500 (EST)", "ts" : "Thu Jan 06 2011 11:09:54 GMT-0500 (EST)", "urls" :
[ "http://www.businessinsider.com/" ], "urls_count" : 1, "vars" : { "name" : "eonwhite" } }

Proﬁles Accessible
Everywhere
• Put abandoned shopping cart notiﬁcations
within a mass email
{if profile.purchase_incomplete}
<p>This is what’s in your cart:</p>
{foreach profile.purchase_incomplete.items as item}
{item.qty} <a href=”{item.url}”>{item.title}</a><br/>
{/foreach}
{/if}

Everywhere
• Show a section of content conditional on
the user’s location

{if profile.geo.city[‘New York, NY US’]}
<div>Come to the New York Meetup on the 27th!</div>
{/if}

Everywhere
• Show different content depending on user
interests as measured by on-site behavior
{select}
{case horizon_interest('black,dark')}
<img src="http://example.com/dress-image-black.jpg" />
{/case}
{case horizon_interest('green')}
<img src="http://example.com/dress-image-green.jpg" />
{/case}
{case horizon_interest('purple,polka_dot,pattern')}
<img src="http://example.com/dress-image-polkadot.jpg" />
{/case}
{/select}

Everywhere
• Pick top content from a data feed based on
tags

{content = horizon_select(content,10)}

{foreach content as c}
<a href=”{c.url}”>{c.title}</a><br/>
{/foreach}

Other Advantages of
MongoDB
• High performance
• Take any parameters from our clients
• Really ﬂexible development
• Great for analytics (internal and external)
• No more downtime for schema migrations
or reindexing

How We Run mongod
• mongod --dbpath /path/to/db --logpath /path/to/log/
mongodb.log --logappend --fork --rest --replSet
main1 --journal

• Don’t ever run without replication
• Don’t ever kill -9
• Don’t run without writing to a log
• Run behind a ﬁrewall
• Use journaling now that it’s there
• Use --rest, it’s handy

Separate DBs By
Collections
• Lower-effort than auto-sharding
• Separate databases for different usage
patterns
• Consider consequences of database failure/
unavailability
• But make sure your backup and monitoring
strategy is prepared for multiple DBs

Our Five Replica Sets
• main: most of the stuff on the UI, lots of
small/medium collections
• horizon: realtime onsite browsing data
• proﬁle: user proﬁle data (60m user docs)
• message: last three months of emails
• archive: emails older than three months

Monitoring

• Some stuff to monitor: faults/sec, index
misses, % locked, queue size, load average
• we check basic status once/minute on all
database servers (SMS alerts if down), email
warnings on thresholds every 10 minutes
• have been beta-ing 10gen’s MMS product

Backups
• Used to use mongodump - don’t do that
anymore
• Have single node of each replica set on a
backup server
• Two-hour slave delay
• fsync/lock, freeze xfs ﬁle system, EBS
snapshot, unfreeze, unlock

The Great EC2 EBS
Outage Adventure
• We survived
• Most of our nodes unavailable for 2-4 days
• Were able to spin up new instances from
backup server, snapshots, and get
operational within hours
• Wasn’t fun

EC2 Future Plans
• EC2 is great overall
• EBS performance a little too inconsistent
(even with RAID 0 or10)
• Moving to relying on physical hardware
(with SSD) in colo
• Retain some nodes and backups on EC2
• Let you know how it goes in a few months

Develop Your Mental
Model of MongoDB

• You don’t need to look at the internals
• But try to gain a working understanding of
how MongoDB operates, especially RAM
and indexes

Big-Picture Design
Questions
• What is the data I want to store?
• How will I want to use that data later?
• How big will the data get?
• If the answers are “I don’t know yet”, guess
with your best YAGNI

“But premature
optimization is evil”
• Knuth said that about code, which is
ﬂexible and easy to optimize later
• Data is not as ﬂexible as code
• So doing some planning for performance is
usually good when it comes to your data

Speciﬁc MongoDB
Design Questions
• Embed vs top-level collection?
• Denormalize (double-store data)?
• How many/which indexes?
• Arrays vs hashes for embedding?
• Implicit schema (ﬁeld names and types)

Short Field Names?
• Disk space: cheap
• RAM: not cheap
• Developer Time: expensive
• Err towards compact, readable ﬁeldnames
• Might be worth writing a mapper
• Probably wish we’d used c instead of
client_id

Favor Human-Readable
Foreign Keys
• DBRefs are a bit cumbersome
• Referencing by MongoId often means doing
extra lookups
• Build human-readable references to save
you doing lookups and manual joins

Example

• Store the Template and the Email as strings
on the message object
• { template: “Internal - Blast Notify”, email:
“support-alerts@sailthru.com” }

• No external reference lookups required
• The tradeoff is basically just disk space

Embed vs Top-Level
Collections?
• Major question of MongoDB schema design
• If you can ask the question at all, you might
want to err on the side of embedding
• Don’t embed if the embedding could get
huge
• Don’t feel too bad about denormalizing by
embedding AND storing in a top-level
collection

Typical Properties of
Top-Level Collections

• Independence: They don’t “belong”
conceptually to another collection
• Nouns: the building blocks of your system
• Easily referenceable and updatable

Embedding Pros
• Super-fast retrieval of document with
related data
• Atomic updates
• “Ownership” of embedded document is
obvious
• Usually maps well to code structures

Embedding Cons

• Harder to get at, do mass queries
• Does not size up inﬁnitely, will hit 16MB
limit
• Hard to create references to embedded
object
• Limited ability to indexed-sort the
embedded objects

If You Think You Can
Embed
• You probably should
• I take advantage of embedding in my
designs more often now than I did three
years ago
• It’s a gift MongoDB gives you in exchange
for giving up your joins

Design Example:
User Permissions
• Users can have various broad permission
levels for any number of clients
• For example, user ‘ploki’ might have
permission level ‘admin’ for client 76 and
permission level ‘reports_only’ for client
450

How Will We Use This
Data?

• Retrieve all clients for a given user
• Retrieve all users for a given client
• Retrieve a permission level for a given
client for a given user

How Will This Data
Grow?

• In the medium term, it will stay small
• Number of clients and number of users can
both grow inﬁnitely

Back in SQL-land

• There’s a fairly standard way to do it
• It’s a many-many relationship, so
• Use a join table (client_user)

Should We Use a New
Top-Level Collection?
db.client.user.save( {
client_id: 76,
username: ‘ploki’,
permission: ‘admin’,
});
db.client.user.save( {
client_id: 450,
username: ‘ploki’,
permission: ‘reports_only’,
});

db.client.user.ensureIndex( { client_id: 1 } );
db.client.user.ensureIndex( { username: 1 } );

// get all users belonging to a client
db.client.user.find( { client_id: 76 } );

// get all clients a user has access to
db.client.user.find( { username: ‘ibwhite’ } );

// get permissions for our current user
db.client.user.findOne( { username: user.name } );

Probably Not

• Only needed if we have lots of clients per
user AND lots of users per client
• This is a case where we can embed, so let’s
do so

Three Ways to Embed
‘clients’: {
‘76’: ‘admin’, Not good:
Object ‘450’: ‘reports_only’, can’t do a multikeys index
}, on the keys of a hash
index:???

Okay:
Array ‘clients’: [
{‘_id’: 76, ‘access’: ‘admin’}, but have to search
through array
of objects },
{‘_id’: 450, ‘access’: ‘reports_only’}
to ﬁnd by _id
index: { ‘clients._id’: 1 } on retrieved doc

‘clients’: [ 76, 450 ],
Our approach:
Array
‘clients_access’: {
’76’: ‘admin’, Fields next to each
other alphabetically
and object
‘450’: ‘reports_only’,
}
index: { clients: 1 }

Indexes
• Index all highly frequent queries
• Do less-indexed queries only on
secondaries
• Reduce the size of indexes whereever you
can on big collections
• Don’t sweat the medium-sized collections,
focus on the big wins

Take Advantage of
Multiple-Field Indexes
• Order matters
• If you have an index on {client_id:
1, email: 1 }

• Then you also have the {client_id:
1} index “for free”

• but not { email: 1}

Use your _id

• You must use an _id for every collection,
which will cost you index size
• So do something useful with _id

Take advantage of fast
^indexes
• Messages have _ids like: 32423.00000341
• Need all messages in blast 32423:
• db.message.blast.find(
{ _id: /^32423./ } );

• (Yeah, I know the . is ugly. Don’t use a dot if you do this.)

Manual Range
Partioning
• We moved a big message.blast collection
into per-day collections:
• message.blast.20110605
message.blast.20110606
message.blast.20110607
etc...

• Keeps working set indexes smaller
• When we move data into the archive,
drop() is much faster than remove()

Questions?
Looking for a job?
ian@sailthru.com
twitter.com/eonwhite

Mongo at Sailthru (MongoNYC 2011)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (7)

Ähnlich wie Mongo at Sailthru (MongoNYC 2011)

Ähnlich wie Mongo at Sailthru (MongoNYC 2011) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mongo at Sailthru (MongoNYC 2011)

Hinweis der Redaktion