Sailthru uses MongoDB to store user profile data, including interests and behaviors, in documents rather than across multiple SQL tables. This allows them to access user data from any part of their system and personalize content. They run MongoDB on Amazon EC2 across 11 servers in 5 replica sets and 1 backup server totaling around 1TB of data. Key advantages of MongoDB for them include flexibility, performance, and not needing downtime for schema changes.
2. Sailthru
• API-based transactional email led to...
• Mass campaign email led to...
• Intelligence and user behavior
• Three engineers built the ESP we always
wanted to use
• Some Clients: Huffpo-AOL, Thrillist,
Refinery 29, Flavorpill, Business Insider,
Lot18, Fab, New York Observer
3. How We Got To
MongoDB from SQL
• JSON was part of Sailthru infrastructure
from start (SQL columns and S3)
• Kept a close eye on CouchDB project
• MongoDB felt like natural fit
• Used for user profiles and analytics initially
• Migrated one table at a time (very, very
carefully)
4. Sailthru Architecture
• User interface to display stats, build
campaigns and templates, etc (PHP/EC2)
• API, link rewriting, and onsite endpoints
(PHP/EC2)
• Core mailer engine (Java/EC2 and colo)
• Modified-postfix SMTP servers (colo)
• 11 database servers on EC2 (for now)
5. MongoDB Overview
• 11 instances on EC2 (5 two-member
replica sets, 1 backup server)
• About 40 collections
• About 1TB
• Largest single collection is 500m docs
6. Users are Documents
• Users aren’t records split among multiple
tables
• End user’s lists, clickstream interests,
geolocation, browser, time of day, purchase
history becomes one ever-growing
document
8. Profiles Accessible
Everywhere
• Put abandoned shopping cart notifications
within a mass email
{if profile.purchase_incomplete}
<p>This is what’s in your cart:</p>
{foreach profile.purchase_incomplete.items as item}
{item.qty} <a href=”{item.url}”>{item.title}</a><br/>
{/foreach}
{/if}
9. Profiles Accessible
Everywhere
• Show a section of content conditional on
the user’s location
{if profile.geo.city[‘New York, NY US’]}
<div>Come to the New York Meetup on the 27th!</div>
{/if}
10. Profiles Accessible
Everywhere
• Show different content depending on user
interests as measured by on-site behavior
{select}
{case horizon_interest('black,dark')}
<img src="http://example.com/dress-image-black.jpg" />
{/case}
{case horizon_interest('green')}
<img src="http://example.com/dress-image-green.jpg" />
{/case}
{case horizon_interest('purple,polka_dot,pattern')}
<img src="http://example.com/dress-image-polkadot.jpg" />
{/case}
{/select}
11. Profiles Accessible
Everywhere
• Pick top content from a data feed based on
tags
{content = horizon_select(content,10)}
{foreach content as c}
<a href=”{c.url}”>{c.title}</a><br/>
{/foreach}
12. Other Advantages of
MongoDB
• High performance
• Take any parameters from our clients
• Really flexible development
• Great for analytics (internal and external)
• No more downtime for schema migrations
or reindexing
13. How We Run mongod
• mongod --dbpath /path/to/db --logpath /path/to/log/
mongodb.log --logappend --fork --rest --replSet
main1 --journal
• Don’t ever run without replication
• Don’t ever kill -9
• Don’t run without writing to a log
• Run behind a firewall
• Use journaling now that it’s there
• Use --rest, it’s handy
14. Separate DBs By
Collections
• Lower-effort than auto-sharding
• Separate databases for different usage
patterns
• Consider consequences of database failure/
unavailability
• But make sure your backup and monitoring
strategy is prepared for multiple DBs
15. Our Five Replica Sets
• main: most of the stuff on the UI, lots of
small/medium collections
• horizon: realtime onsite browsing data
• profile: user profile data (60m user docs)
• message: last three months of emails
• archive: emails older than three months
16. Monitoring
• Some stuff to monitor: faults/sec, index
misses, % locked, queue size, load average
• we check basic status once/minute on all
database servers (SMS alerts if down), email
warnings on thresholds every 10 minutes
• have been beta-ing 10gen’s MMS product
17. Backups
• Used to use mongodump - don’t do that
anymore
• Have single node of each replica set on a
backup server
• Two-hour slave delay
• fsync/lock, freeze xfs file system, EBS
snapshot, unfreeze, unlock
18. The Great EC2 EBS
Outage Adventure
• We survived
• Most of our nodes unavailable for 2-4 days
• Were able to spin up new instances from
backup server, snapshots, and get
operational within hours
• Wasn’t fun
19. EC2 Future Plans
• EC2 is great overall
• EBS performance a little too inconsistent
(even with RAID 0 or10)
• Moving to relying on physical hardware
(with SSD) in colo
• Retain some nodes and backups on EC2
• Let you know how it goes in a few months
21. Develop Your Mental
Model of MongoDB
• You don’t need to look at the internals
• But try to gain a working understanding of
how MongoDB operates, especially RAM
and indexes
22. Big-Picture Design
Questions
• What is the data I want to store?
• How will I want to use that data later?
• How big will the data get?
• If the answers are “I don’t know yet”, guess
with your best YAGNI
23. “But premature
optimization is evil”
• Knuth said that about code, which is
flexible and easy to optimize later
• Data is not as flexible as code
• So doing some planning for performance is
usually good when it comes to your data
24. Specific MongoDB
Design Questions
• Embed vs top-level collection?
• Denormalize (double-store data)?
• How many/which indexes?
• Arrays vs hashes for embedding?
• Implicit schema (field names and types)
25. Short Field Names?
• Disk space: cheap
• RAM: not cheap
• Developer Time: expensive
• Err towards compact, readable fieldnames
• Might be worth writing a mapper
• Probably wish we’d used c instead of
client_id
26. Favor Human-Readable
Foreign Keys
• DBRefs are a bit cumbersome
• Referencing by MongoId often means doing
extra lookups
• Build human-readable references to save
you doing lookups and manual joins
27. Example
• Store the Template and the Email as strings
on the message object
• { template: “Internal - Blast Notify”, email:
“support-alerts@sailthru.com” }
• No external reference lookups required
• The tradeoff is basically just disk space
28. Embed vs Top-Level
Collections?
• Major question of MongoDB schema design
• If you can ask the question at all, you might
want to err on the side of embedding
• Don’t embed if the embedding could get
huge
• Don’t feel too bad about denormalizing by
embedding AND storing in a top-level
collection
29. Typical Properties of
Top-Level Collections
• Independence: They don’t “belong”
conceptually to another collection
• Nouns: the building blocks of your system
• Easily referenceable and updatable
30. Embedding Pros
• Super-fast retrieval of document with
related data
• Atomic updates
• “Ownership” of embedded document is
obvious
• Usually maps well to code structures
31. Embedding Cons
• Harder to get at, do mass queries
• Does not size up infinitely, will hit 16MB
limit
• Hard to create references to embedded
object
• Limited ability to indexed-sort the
embedded objects
32. If You Think You Can
Embed
• You probably should
• I take advantage of embedding in my
designs more often now than I did three
years ago
• It’s a gift MongoDB gives you in exchange
for giving up your joins
33. Design Example:
User Permissions
• Users can have various broad permission
levels for any number of clients
• For example, user ‘ploki’ might have
permission level ‘admin’ for client 76 and
permission level ‘reports_only’ for client
450
34. How Will We Use This
Data?
• Retrieve all clients for a given user
• Retrieve all users for a given client
• Retrieve a permission level for a given
client for a given user
35. How Will This Data
Grow?
• In the medium term, it will stay small
• Number of clients and number of users can
both grow infinitely
36. Back in SQL-land
• There’s a fairly standard way to do it
• It’s a many-many relationship, so
• Use a join table (client_user)
37. Should We Use a New
Top-Level Collection?
db.client.user.save( {
client_id: 76,
username: ‘ploki’,
permission: ‘admin’,
});
db.client.user.save( {
client_id: 450,
username: ‘ploki’,
permission: ‘reports_only’,
});
db.client.user.ensureIndex( { client_id: 1 } );
db.client.user.ensureIndex( { username: 1 } );
// get all users belonging to a client
db.client.user.find( { client_id: 76 } );
// get all clients a user has access to
db.client.user.find( { username: ‘ibwhite’ } );
// get permissions for our current user
db.client.user.findOne( { username: user.name } );
38. Probably Not
• Only needed if we have lots of clients per
user AND lots of users per client
• This is a case where we can embed, so let’s
do so
39. Three Ways to Embed
‘clients’: {
‘76’: ‘admin’, Not good:
Object ‘450’: ‘reports_only’, can’t do a multikeys index
}, on the keys of a hash
index:???
Okay:
Array ‘clients’: [
{‘_id’: 76, ‘access’: ‘admin’}, but have to search
through array
of objects },
{‘_id’: 450, ‘access’: ‘reports_only’}
to find by _id
index: { ‘clients._id’: 1 } on retrieved doc
‘clients’: [ 76, 450 ],
Our approach:
Array
‘clients_access’: {
’76’: ‘admin’, Fields next to each
other alphabetically
and object
‘450’: ‘reports_only’,
}
index: { clients: 1 }
40. Indexes
• Index all highly frequent queries
• Do less-indexed queries only on
secondaries
• Reduce the size of indexes whereever you
can on big collections
• Don’t sweat the medium-sized collections,
focus on the big wins
41. Take Advantage of
Multiple-Field Indexes
• Order matters
• If you have an index on {client_id:
1, email: 1 }
• Then you also have the {client_id:
1} index “for free”
• but not { email: 1}
42. Use your _id
• You must use an _id for every collection,
which will cost you index size
• So do something useful with _id
43. Take advantage of fast
^indexes
• Messages have _ids like: 32423.00000341
• Need all messages in blast 32423:
• db.message.blast.find(
{ _id: /^32423./ } );
• (Yeah, I know the . is ugly. Don’t use a dot if you do this.)
44. Manual Range
Partioning
• We moved a big message.blast collection
into per-day collections:
• message.blast.20110605
message.blast.20110606
message.blast.20110607
etc...
• Keeps working set indexes smaller
• When we move data into the archive,
drop() is much faster than remove()