SlideShare ist ein Scribd-Unternehmen logo
1 von 45
MongoDB at Sailthru

         Ian White
        @eonwhite
      MongoNYC 2011
           6/7/11
Sailthru
• API-based transactional email led to...
• Mass campaign email led to...
• Intelligence and user behavior
• Three engineers built the ESP we always
  wanted to use
• Some Clients: Huffpo-AOL, Thrillist,
  Refinery 29, Flavorpill, Business Insider,
  Lot18, Fab, New York Observer
How We Got To
 MongoDB from SQL
• JSON was part of Sailthru infrastructure
  from start (SQL columns and S3)
• Kept a close eye on CouchDB project
• MongoDB felt like natural fit
• Used for user profiles and analytics initially
• Migrated one table at a time (very, very
  carefully)
Sailthru Architecture
• User interface to display stats, build
  campaigns and templates, etc (PHP/EC2)
• API, link rewriting, and onsite endpoints
  (PHP/EC2)
• Core mailer engine (Java/EC2 and colo)
• Modified-postfix SMTP servers (colo)
• 11 database servers on EC2 (for now)
MongoDB Overview

• 11 instances on EC2 (5 two-member
  replica sets, 1 backup server)
• About 40 collections
• About 1TB
• Largest single collection is 500m docs
Users are Documents

• Users aren’t records split among multiple
  tables
• End user’s lists, clickstream interests,
  geolocation, browser, time of day, purchase
  history becomes one ever-growing
  document
User Profile
{ "_id" : ObjectId("4b2d368aed948543a5fca4b4"), "browser" : { "Chrome" : 3, "Firefox" : 1, "iPhone" : 2 }, "click_count" : 1, "click_time" :
 "Wed Feb 17 2010 09:03:37 GMT-0500 (EST)", "client_id" : 450, "email" : "ibwhite@gmail.com", "email_hour" : { "13" : 1, "14" : 2, "16" : 2,
 "17" : 2, "18" : 3, "21" : 2 }, "geo" : { "city" : { "New York, NY US" : 3, "Sterling, VA US" : 1 }, "count" : 6, "country" : { "US" : 6 },
       "state" : { "NY US" : 3, "VA US" : 1 }, "zip" : { "10011 US" : 1, "10065 US" : 1 } }, "horizon" : { "admob" : 1, "advertising" : 3,
"afghanistan" : 1, "aig" : 2, "airline-industry" : 2, "alleyinsider" : 45, "analyst-research" : 1, "apple" : 25, "apple-tablet" : 5, "att" :
8, "bailout" : 5, "banks" : 6, "barack-obama" : 25, "ben-bernanke" : 1, "big-tech" : 17, "billionaires" : 1, "boats" : 1, "bonus" : 6, "bp" :
1, "budget" : 1, "cable" : 1, "caribbean" : 2, "cars" : 5, "chart-of-the-day" : 3, "china" : 3, "clusterstock" : 36, "cnbc" : 1, "comcast" :
   1, "commodities" : 3, "conan-obrien" : 6, "crime" : 2, "curbedcom" : 1, "death-of-tv" : 1, "debt" : 7, "deepwater-horizon-oil-spill" : 1,
      "dell" : 4, "development" : 1, "dick-fuld" : 1, "economy" : 10, "education" : 1, "employment" : 2, "entertainment" : 7, "europe" : 1,
   "facebook" : 4, "features" : 13, "financial-crisis" : 7, "financial-services" : 2, "fox" : 4, "fraud" : 1, "futures" : 1, "gadgets" : 21,
 "gas" : 1, "gawker" : 5, "gold" : 3, "goldman-sachs" : 1, "google" : 7, "green" : 5, "green-tech" : 2, "health" : 5, "health-care-reform" :
      7, "hedge-funds" : 3, "hires-and-fires" : 1, "housing-crisis" : 1, "hp" : 4, "hulu" : 2, "humor" : 1, "iad" : 1, "international" : 3,
    "investing" : 5, "ios" : 1, "ipad" : 2, "iphone" : 10, "jay-leno" : 5, "jim-cramer" : 1, "jobs" : 2, "john-gruber" : 2, "law-firms" : 1,
         "lawreview" : 3, "lehman-brothers" : 1, "litigation" : 5, "luxury" : 1, "mac" : 1, "magazines" : 1, "markets" : 7, "media" : 20,
        "mercedesbenz" : 4, "microsoft" : 1, "mining" : 1, "mobile" : 14, "mobile-ads" : 2, "moguls" : 1, "money" : 6, "money-media" : 2,
"moneygame" : 16, "morningstar" : 3, "mortgages" : 1, "mtv" : 1, "nbc" : 6, "new-york" : 1, "new-york-times" : 4, "news" : 9, "newspapers" :
   5, "nouriel-roubini" : 6, "oil" : 1, "online" : 10, "optimum-energy" : 4, "paul-krugman" : 3, "people" : 5, "politics" : 26, "radio" : 1,
     "real-estate" : 2, "recession" : 4, "regulation" : 12, "sai" : 15, "satellite-radio" : 1, "scandals" : 5, "security" : 1, "senate" : 4,
 "silicon-alley-insider" : 1, "sirius" : 1, "social-networking" : 3, "sports" : 1, "startups" : 1, "steve-jobs" : 1, "stimulus" : 1, "stock-
 market" : 5, "stocks" : 3, "tax-cuts" : 1, "taxes" : 1, "tbi" : 163, "tbi-live" : 3, "terrorism" : 3, "the-atlantic" : 1, "the-way-we-live-
  now" : 1, "themoneygame" : 3, "thewire" : 17, "tim-geithner" : 3, "time-warner-cable" : 1, "transportation" : 7, "treasury" : 2, "tv" : 7,
"tv-everywhere" : 1, "twitter" : 3, "uk" : 1, "unemployment" : 2, "us-government" : 8, "verizon" : 4, "video" : 6, "wall-st-cheat-sheet" : 1,
  "wall-street" : 25, "wall-street-journal" : 1, "warren-buffett" : 1, "white-house" : 4, "wwdc-2010" : 1, "yachts" : 1, "10gen" : 1, "2010-
   world-cup" : 1 }, "horizon_count" : 303, "horizon_time" : "Tue Dec 07 2010 15:26:35 GMT-0500 (EST)", "lists" : [ "TBI Research 1 - Beta",
"Dedicated Email", "TBI Research", "411" ], "lists_signup" : { "BI_iphone App" : null, "Clusterstock Chart Of The Day" : null, "Clusterstock
Select" : null, "Dedicated Email" : "Tue Dec 22 2009 13:29:43 GMT-0500 (EST)", "Dedicated Email - The Ladders" : null, "Green Sheet Select" :
null, "Insider 411" : null, "Insider 411 - Economist" : null, "Insider 411 - Ooyala" : null, "Insider 411 - The Wire Promo" : null, "Insider
  411- Economist" : null, "Law Review Select" : null, "Media Select" : null, "Silicon Alley Insider Chart Of The Day" : null, "Silicon Alley
     Insider Select" : null, "TBI Research" : "Tue Jan 05 2010 13:58:09 GMT-0500 (EST)", "TBI Research 1 - Beta" : "Mon Nov 09 2009 12:34:58
  GMT-0500 (EST)", "TBI Select" : null, "The Money Game Select" : null, "War Room Select" : null, "z_sailthru" : null, "10 Things Before the
      Opening Bell" : null, "411" : "Wed Jul 07 2010 11:28:03 GMT-0400 (EDT)" }, "open_count" : 11, "open_time" : "Tue Dec 07 2010 13:30:31
  GMT-0500 (EST)", "optout_templates" : [ ], "order" : 12, "signup_time" : "Mon Nov 09 2009 12:34:58 GMT-0500 (EST)", "site_hour" : { "20" :
 1 }, "status" : null, "status_time" : "Thu Jan 06 2011 11:09:54 GMT-0500 (EST)", "ts" : "Thu Jan 06 2011 11:09:54 GMT-0500 (EST)", "urls" :
                            [ "http://www.businessinsider.com/" ], "urls_count" : 1, "vars" : { "name" : "eonwhite" } }
Profiles Accessible
       Everywhere
• Put abandoned shopping cart notifications
  within a mass email
{if profile.purchase_incomplete}
 <p>This is what’s in your cart:</p>
 {foreach profile.purchase_incomplete.items as item}
   {item.qty} <a href=”{item.url}”>{item.title}</a><br/>
 {/foreach}
{/if}
Profiles Accessible
       Everywhere
• Show a section of content conditional on
  the user’s location

{if profile.geo.city[‘New York, NY US’]}
  <div>Come to the New York Meetup on the 27th!</div>
{/if}
Profiles Accessible
        Everywhere
• Show different content depending on user
   interests as measured by on-site behavior
{select}
  {case horizon_interest('black,dark')}
    <img src="http://example.com/dress-image-black.jpg" />
  {/case}
  {case horizon_interest('green')}
    <img src="http://example.com/dress-image-green.jpg" />
  {/case}
  {case horizon_interest('purple,polka_dot,pattern')}
    <img src="http://example.com/dress-image-polkadot.jpg" />
  {/case}
{/select}
Profiles Accessible
        Everywhere
• Pick top content from a data feed based on
   tags


{content = horizon_select(content,10)}

{foreach content as c}
  <a href=”{c.url}”>{c.title}</a><br/>
{/foreach}
Other Advantages of
     MongoDB
• High performance
• Take any parameters from our clients
• Really flexible development
• Great for analytics (internal and external)
• No more downtime for schema migrations
  or reindexing
How We Run mongod
•   mongod --dbpath /path/to/db --logpath /path/to/log/
    mongodb.log --logappend --fork --rest --replSet
    main1 --journal


• Don’t ever run without replication
• Don’t ever kill -9
• Don’t run without writing to a log
• Run behind a firewall
• Use journaling now that it’s there
• Use --rest, it’s handy
Separate DBs By
       Collections
• Lower-effort than auto-sharding
• Separate databases for different usage
  patterns
• Consider consequences of database failure/
  unavailability
• But make sure your backup and monitoring
  strategy is prepared for multiple DBs
Our Five Replica Sets
• main: most of the stuff on the UI, lots of
  small/medium collections
• horizon: realtime onsite browsing data
• profile: user profile data (60m user docs)
• message: last three months of emails
• archive: emails older than three months
Monitoring

• Some stuff to monitor: faults/sec, index
  misses, % locked, queue size, load average
• we check basic status once/minute on all
  database servers (SMS alerts if down), email
  warnings on thresholds every 10 minutes
• have been beta-ing 10gen’s MMS product
Backups
• Used to use mongodump - don’t do that
  anymore
• Have single node of each replica set on a
  backup server
• Two-hour slave delay
• fsync/lock, freeze xfs file system, EBS
  snapshot, unfreeze, unlock
The Great EC2 EBS
  Outage Adventure
• We survived
• Most of our nodes unavailable for 2-4 days
• Were able to spin up new instances from
  backup server, snapshots, and get
  operational within hours
• Wasn’t fun
EC2 Future Plans
• EC2 is great overall
• EBS performance a little too inconsistent
  (even with RAID 0 or10)
• Moving to relying on physical hardware
  (with SSD) in colo
• Retain some nodes and backups on EC2
• Let you know how it goes in a few months
DESIGN
Develop Your Mental
 Model of MongoDB

• You don’t need to look at the internals
• But try to gain a working understanding of
  how MongoDB operates, especially RAM
  and indexes
Big-Picture Design
        Questions
• What is the data I want to store?
• How will I want to use that data later?
• How big will the data get?
• If the answers are “I don’t know yet”, guess
  with your best YAGNI
“But premature
  optimization is evil”
• Knuth said that about code, which is
  flexible and easy to optimize later
• Data is not as flexible as code
• So doing some planning for performance is
  usually good when it comes to your data
Specific MongoDB
    Design Questions
• Embed vs top-level collection?
• Denormalize (double-store data)?
• How many/which indexes?
• Arrays vs hashes for embedding?
• Implicit schema (field names and types)
Short Field Names?
• Disk space: cheap
• RAM: not cheap
• Developer Time: expensive
• Err towards compact, readable fieldnames
• Might be worth writing a mapper
• Probably wish we’d used c instead of
  client_id
Favor Human-Readable
     Foreign Keys
• DBRefs are a bit cumbersome
• Referencing by MongoId often means doing
  extra lookups
• Build human-readable references to save
  you doing lookups and manual joins
Example



• Store the Template and the Email as strings
    on the message object
•   { template: “Internal - Blast Notify”, email:
    “support-alerts@sailthru.com” }


• No external reference lookups required
• The tradeoff is basically just disk space
Embed vs Top-Level
     Collections?
• Major question of MongoDB schema design
• If you can ask the question at all, you might
  want to err on the side of embedding
• Don’t embed if the embedding could get
  huge
• Don’t feel too bad about denormalizing by
  embedding AND storing in a top-level
  collection
Typical Properties of
Top-Level Collections

• Independence: They don’t “belong”
  conceptually to another collection
• Nouns: the building blocks of your system
• Easily referenceable and updatable
Embedding Pros
• Super-fast retrieval of document with
  related data
• Atomic updates
• “Ownership” of embedded document is
  obvious
• Usually maps well to code structures
Embedding Cons

• Harder to get at, do mass queries
• Does not size up infinitely, will hit 16MB
  limit
• Hard to create references to embedded
  object
• Limited ability to indexed-sort the
  embedded objects
If You Think You Can
          Embed
• You probably should
• I take advantage of embedding in my
  designs more often now than I did three
  years ago
• It’s a gift MongoDB gives you in exchange
  for giving up your joins
Design Example:
     User Permissions
• Users can have various broad permission
  levels for any number of clients
• For example, user ‘ploki’ might have
  permission level ‘admin’ for client 76 and
  permission level ‘reports_only’ for client
  450
How Will We Use This
      Data?

• Retrieve all clients for a given user
• Retrieve all users for a given client
• Retrieve a permission level for a given
  client for a given user
How Will This Data
      Grow?

• In the medium term, it will stay small
• Number of clients and number of users can
  both grow infinitely
Back in SQL-land

• There’s a fairly standard way to do it
• It’s a many-many relationship, so
• Use a join table (client_user)
Should We Use a New
Top-Level Collection?
  db.client.user.save( {
    client_id: 76,
    username: ‘ploki’,
    permission: ‘admin’,
  });
  db.client.user.save( {
    client_id: 450,
    username: ‘ploki’,
    permission: ‘reports_only’,
  });

  db.client.user.ensureIndex( { client_id: 1 } );
  db.client.user.ensureIndex( { username: 1 } );

  // get all users belonging to a client
  db.client.user.find( { client_id: 76 } );

  // get all clients a user has access to
  db.client.user.find( { username: ‘ibwhite’ } );

  // get permissions for our current user
  db.client.user.findOne( { username: user.name } );
Probably Not


• Only needed if we have lots of clients per
  user AND lots of users per client
• This is a case where we can embed, so let’s
  do so
Three Ways to Embed
             ‘clients’: {
                ‘76’: ‘admin’,                                   Not good:
  Object        ‘450’: ‘reports_only’,                   can’t do a multikeys index
             },                                            on the keys of a hash
             index:???


                                                                  Okay:
  Array      ‘clients’: [
                {‘_id’: 76, ‘access’: ‘admin’},             but have to search
                                                              through array
of objects   },
                {‘_id’: 450, ‘access’: ‘reports_only’}
                                                              to find by _id
             index: { ‘clients._id’: 1 }                     on retrieved doc


             ‘clients’: [ 76, 450 ],
                                                            Our approach:
  Array
             ‘clients_access’: {
               ’76’: ‘admin’,                             Fields next to each
                                                          other alphabetically
and object
               ‘450’: ‘reports_only’,
             }
             index: { clients: 1 }
Indexes
• Index all highly frequent queries
• Do less-indexed queries only on
  secondaries
• Reduce the size of indexes whereever you
  can on big collections
• Don’t sweat the medium-sized collections,
  focus on the big wins
Take Advantage of
Multiple-Field Indexes
• Order matters
• If you have an index on {client_id:
  1, email: 1 }

• Then you also have the {client_id:
  1} index “for free”

• but not {   email: 1}
Use your _id


• You must use an _id for every collection,
  which will cost you index size
• So do something useful with _id
Take advantage of fast
      ^indexes
• Messages have _ids like: 32423.00000341
• Need all messages in blast 32423:
• db.message.blast.find(
        { _id: /^32423./ } );

•   (Yeah, I know the . is ugly. Don’t use a dot if you do this.)
Manual Range
              Partioning
• We moved a big message.blast collection
    into per-day collections:
•   message.blast.20110605
    message.blast.20110606
    message.blast.20110607
    etc...


• Keeps working set indexes smaller
• When we move data into the archive,
    drop() is much faster than remove()
Questions?
Looking for a job?
     ian@sailthru.com
   twitter.com/eonwhite

Weitere ähnliche Inhalte

Was ist angesagt?

Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationMongoDB
 
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB AtlasMongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB AtlasMongoDB
 
Building a Social Network with MongoDB
  Building a Social Network with MongoDB  Building a Social Network with MongoDB
Building a Social Network with MongoDBFred Chu
 
JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataGregg Kellogg
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB
 
High Performance Applications with MongoDB
High Performance Applications with MongoDBHigh Performance Applications with MongoDB
High Performance Applications with MongoDBMongoDB
 
Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG MeetingApache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG MeetingMyles Braithwaite
 

Was ist angesagt? (7)

Back to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB ApplicationBack to Basics Webinar 2: Your First MongoDB Application
Back to Basics Webinar 2: Your First MongoDB Application
 
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB AtlasMongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas
MongoDB Europe 2016 - MongoDB 3.4 preview and introduction to MongoDB Atlas
 
Building a Social Network with MongoDB
  Building a Social Network with MongoDB  Building a Social Network with MongoDB
Building a Social Network with MongoDB
 
JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked Data
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and Implications
 
High Performance Applications with MongoDB
High Performance Applications with MongoDBHigh Performance Applications with MongoDB
High Performance Applications with MongoDB
 
Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG MeetingApache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
Apache CouchDB Presentation @ Sept. 2104 GTALUG Meeting
 

Ähnlich wie Mongo at Sailthru (MongoNYC 2011)

Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with sparkMarissa Saunders
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System ModernisationMongoDB
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework MongoDB
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015StampedeCon
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoringspil-engineering
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseMongoDB
 
Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015kingsBSD
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldOpenSource Connections
 
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014Dynamic Apps with WebSockets and MQTT - IBM Impact 2014
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014Bryan Boyd
 
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhentranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhenDavid Peyruc
 
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...NoSQLmatters
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkMongoDB
 
Open source for customer analytics
Open source for customer analyticsOpen source for customer analytics
Open source for customer analyticsMatthias Funke
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning ProductsAndrew Musselman
 

Ähnlich wie Mongo at Sailthru (MongoNYC 2011) (20)

Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with spark
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System Modernisation
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
MySQL Performance Monitoring
MySQL Performance MonitoringMySQL Performance Monitoring
MySQL Performance Monitoring
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014Dynamic Apps with WebSockets and MQTT - IBM Impact 2014
Dynamic Apps with WebSockets and MQTT - IBM Impact 2014
 
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhentranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
 
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...
Glynn Bird – Cloudant – Building applications for success.- NoSQL matters Bar...
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
Open source for customer analytics
Open source for customer analyticsOpen source for customer analytics
Open source for customer analytics
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 

Kürzlich hochgeladen

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 

Kürzlich hochgeladen (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 

Mongo at Sailthru (MongoNYC 2011)

  • 1. MongoDB at Sailthru Ian White @eonwhite MongoNYC 2011 6/7/11
  • 2. Sailthru • API-based transactional email led to... • Mass campaign email led to... • Intelligence and user behavior • Three engineers built the ESP we always wanted to use • Some Clients: Huffpo-AOL, Thrillist, Refinery 29, Flavorpill, Business Insider, Lot18, Fab, New York Observer
  • 3. How We Got To MongoDB from SQL • JSON was part of Sailthru infrastructure from start (SQL columns and S3) • Kept a close eye on CouchDB project • MongoDB felt like natural fit • Used for user profiles and analytics initially • Migrated one table at a time (very, very carefully)
  • 4. Sailthru Architecture • User interface to display stats, build campaigns and templates, etc (PHP/EC2) • API, link rewriting, and onsite endpoints (PHP/EC2) • Core mailer engine (Java/EC2 and colo) • Modified-postfix SMTP servers (colo) • 11 database servers on EC2 (for now)
  • 5. MongoDB Overview • 11 instances on EC2 (5 two-member replica sets, 1 backup server) • About 40 collections • About 1TB • Largest single collection is 500m docs
  • 6. Users are Documents • Users aren’t records split among multiple tables • End user’s lists, clickstream interests, geolocation, browser, time of day, purchase history becomes one ever-growing document
  • 7. User Profile { "_id" : ObjectId("4b2d368aed948543a5fca4b4"), "browser" : { "Chrome" : 3, "Firefox" : 1, "iPhone" : 2 }, "click_count" : 1, "click_time" : "Wed Feb 17 2010 09:03:37 GMT-0500 (EST)", "client_id" : 450, "email" : "ibwhite@gmail.com", "email_hour" : { "13" : 1, "14" : 2, "16" : 2, "17" : 2, "18" : 3, "21" : 2 }, "geo" : { "city" : { "New York, NY US" : 3, "Sterling, VA US" : 1 }, "count" : 6, "country" : { "US" : 6 }, "state" : { "NY US" : 3, "VA US" : 1 }, "zip" : { "10011 US" : 1, "10065 US" : 1 } }, "horizon" : { "admob" : 1, "advertising" : 3, "afghanistan" : 1, "aig" : 2, "airline-industry" : 2, "alleyinsider" : 45, "analyst-research" : 1, "apple" : 25, "apple-tablet" : 5, "att" : 8, "bailout" : 5, "banks" : 6, "barack-obama" : 25, "ben-bernanke" : 1, "big-tech" : 17, "billionaires" : 1, "boats" : 1, "bonus" : 6, "bp" : 1, "budget" : 1, "cable" : 1, "caribbean" : 2, "cars" : 5, "chart-of-the-day" : 3, "china" : 3, "clusterstock" : 36, "cnbc" : 1, "comcast" : 1, "commodities" : 3, "conan-obrien" : 6, "crime" : 2, "curbedcom" : 1, "death-of-tv" : 1, "debt" : 7, "deepwater-horizon-oil-spill" : 1, "dell" : 4, "development" : 1, "dick-fuld" : 1, "economy" : 10, "education" : 1, "employment" : 2, "entertainment" : 7, "europe" : 1, "facebook" : 4, "features" : 13, "financial-crisis" : 7, "financial-services" : 2, "fox" : 4, "fraud" : 1, "futures" : 1, "gadgets" : 21, "gas" : 1, "gawker" : 5, "gold" : 3, "goldman-sachs" : 1, "google" : 7, "green" : 5, "green-tech" : 2, "health" : 5, "health-care-reform" : 7, "hedge-funds" : 3, "hires-and-fires" : 1, "housing-crisis" : 1, "hp" : 4, "hulu" : 2, "humor" : 1, "iad" : 1, "international" : 3, "investing" : 5, "ios" : 1, "ipad" : 2, "iphone" : 10, "jay-leno" : 5, "jim-cramer" : 1, "jobs" : 2, "john-gruber" : 2, "law-firms" : 1, "lawreview" : 3, "lehman-brothers" : 1, "litigation" : 5, "luxury" : 1, "mac" : 1, "magazines" : 1, "markets" : 7, "media" : 20, "mercedesbenz" : 4, "microsoft" : 1, "mining" : 1, "mobile" : 14, "mobile-ads" : 2, "moguls" : 1, "money" : 6, "money-media" : 2, "moneygame" : 16, "morningstar" : 3, "mortgages" : 1, "mtv" : 1, "nbc" : 6, "new-york" : 1, "new-york-times" : 4, "news" : 9, "newspapers" : 5, "nouriel-roubini" : 6, "oil" : 1, "online" : 10, "optimum-energy" : 4, "paul-krugman" : 3, "people" : 5, "politics" : 26, "radio" : 1, "real-estate" : 2, "recession" : 4, "regulation" : 12, "sai" : 15, "satellite-radio" : 1, "scandals" : 5, "security" : 1, "senate" : 4, "silicon-alley-insider" : 1, "sirius" : 1, "social-networking" : 3, "sports" : 1, "startups" : 1, "steve-jobs" : 1, "stimulus" : 1, "stock- market" : 5, "stocks" : 3, "tax-cuts" : 1, "taxes" : 1, "tbi" : 163, "tbi-live" : 3, "terrorism" : 3, "the-atlantic" : 1, "the-way-we-live- now" : 1, "themoneygame" : 3, "thewire" : 17, "tim-geithner" : 3, "time-warner-cable" : 1, "transportation" : 7, "treasury" : 2, "tv" : 7, "tv-everywhere" : 1, "twitter" : 3, "uk" : 1, "unemployment" : 2, "us-government" : 8, "verizon" : 4, "video" : 6, "wall-st-cheat-sheet" : 1, "wall-street" : 25, "wall-street-journal" : 1, "warren-buffett" : 1, "white-house" : 4, "wwdc-2010" : 1, "yachts" : 1, "10gen" : 1, "2010- world-cup" : 1 }, "horizon_count" : 303, "horizon_time" : "Tue Dec 07 2010 15:26:35 GMT-0500 (EST)", "lists" : [ "TBI Research 1 - Beta", "Dedicated Email", "TBI Research", "411" ], "lists_signup" : { "BI_iphone App" : null, "Clusterstock Chart Of The Day" : null, "Clusterstock Select" : null, "Dedicated Email" : "Tue Dec 22 2009 13:29:43 GMT-0500 (EST)", "Dedicated Email - The Ladders" : null, "Green Sheet Select" : null, "Insider 411" : null, "Insider 411 - Economist" : null, "Insider 411 - Ooyala" : null, "Insider 411 - The Wire Promo" : null, "Insider 411- Economist" : null, "Law Review Select" : null, "Media Select" : null, "Silicon Alley Insider Chart Of The Day" : null, "Silicon Alley Insider Select" : null, "TBI Research" : "Tue Jan 05 2010 13:58:09 GMT-0500 (EST)", "TBI Research 1 - Beta" : "Mon Nov 09 2009 12:34:58 GMT-0500 (EST)", "TBI Select" : null, "The Money Game Select" : null, "War Room Select" : null, "z_sailthru" : null, "10 Things Before the Opening Bell" : null, "411" : "Wed Jul 07 2010 11:28:03 GMT-0400 (EDT)" }, "open_count" : 11, "open_time" : "Tue Dec 07 2010 13:30:31 GMT-0500 (EST)", "optout_templates" : [ ], "order" : 12, "signup_time" : "Mon Nov 09 2009 12:34:58 GMT-0500 (EST)", "site_hour" : { "20" : 1 }, "status" : null, "status_time" : "Thu Jan 06 2011 11:09:54 GMT-0500 (EST)", "ts" : "Thu Jan 06 2011 11:09:54 GMT-0500 (EST)", "urls" : [ "http://www.businessinsider.com/" ], "urls_count" : 1, "vars" : { "name" : "eonwhite" } }
  • 8. Profiles Accessible Everywhere • Put abandoned shopping cart notifications within a mass email {if profile.purchase_incomplete} <p>This is what’s in your cart:</p> {foreach profile.purchase_incomplete.items as item} {item.qty} <a href=”{item.url}”>{item.title}</a><br/> {/foreach} {/if}
  • 9. Profiles Accessible Everywhere • Show a section of content conditional on the user’s location {if profile.geo.city[‘New York, NY US’]} <div>Come to the New York Meetup on the 27th!</div> {/if}
  • 10. Profiles Accessible Everywhere • Show different content depending on user interests as measured by on-site behavior {select} {case horizon_interest('black,dark')} <img src="http://example.com/dress-image-black.jpg" /> {/case} {case horizon_interest('green')} <img src="http://example.com/dress-image-green.jpg" /> {/case} {case horizon_interest('purple,polka_dot,pattern')} <img src="http://example.com/dress-image-polkadot.jpg" /> {/case} {/select}
  • 11. Profiles Accessible Everywhere • Pick top content from a data feed based on tags {content = horizon_select(content,10)} {foreach content as c} <a href=”{c.url}”>{c.title}</a><br/> {/foreach}
  • 12. Other Advantages of MongoDB • High performance • Take any parameters from our clients • Really flexible development • Great for analytics (internal and external) • No more downtime for schema migrations or reindexing
  • 13. How We Run mongod • mongod --dbpath /path/to/db --logpath /path/to/log/ mongodb.log --logappend --fork --rest --replSet main1 --journal • Don’t ever run without replication • Don’t ever kill -9 • Don’t run without writing to a log • Run behind a firewall • Use journaling now that it’s there • Use --rest, it’s handy
  • 14. Separate DBs By Collections • Lower-effort than auto-sharding • Separate databases for different usage patterns • Consider consequences of database failure/ unavailability • But make sure your backup and monitoring strategy is prepared for multiple DBs
  • 15. Our Five Replica Sets • main: most of the stuff on the UI, lots of small/medium collections • horizon: realtime onsite browsing data • profile: user profile data (60m user docs) • message: last three months of emails • archive: emails older than three months
  • 16. Monitoring • Some stuff to monitor: faults/sec, index misses, % locked, queue size, load average • we check basic status once/minute on all database servers (SMS alerts if down), email warnings on thresholds every 10 minutes • have been beta-ing 10gen’s MMS product
  • 17. Backups • Used to use mongodump - don’t do that anymore • Have single node of each replica set on a backup server • Two-hour slave delay • fsync/lock, freeze xfs file system, EBS snapshot, unfreeze, unlock
  • 18. The Great EC2 EBS Outage Adventure • We survived • Most of our nodes unavailable for 2-4 days • Were able to spin up new instances from backup server, snapshots, and get operational within hours • Wasn’t fun
  • 19. EC2 Future Plans • EC2 is great overall • EBS performance a little too inconsistent (even with RAID 0 or10) • Moving to relying on physical hardware (with SSD) in colo • Retain some nodes and backups on EC2 • Let you know how it goes in a few months
  • 21. Develop Your Mental Model of MongoDB • You don’t need to look at the internals • But try to gain a working understanding of how MongoDB operates, especially RAM and indexes
  • 22. Big-Picture Design Questions • What is the data I want to store? • How will I want to use that data later? • How big will the data get? • If the answers are “I don’t know yet”, guess with your best YAGNI
  • 23. “But premature optimization is evil” • Knuth said that about code, which is flexible and easy to optimize later • Data is not as flexible as code • So doing some planning for performance is usually good when it comes to your data
  • 24. Specific MongoDB Design Questions • Embed vs top-level collection? • Denormalize (double-store data)? • How many/which indexes? • Arrays vs hashes for embedding? • Implicit schema (field names and types)
  • 25. Short Field Names? • Disk space: cheap • RAM: not cheap • Developer Time: expensive • Err towards compact, readable fieldnames • Might be worth writing a mapper • Probably wish we’d used c instead of client_id
  • 26. Favor Human-Readable Foreign Keys • DBRefs are a bit cumbersome • Referencing by MongoId often means doing extra lookups • Build human-readable references to save you doing lookups and manual joins
  • 27. Example • Store the Template and the Email as strings on the message object • { template: “Internal - Blast Notify”, email: “support-alerts@sailthru.com” } • No external reference lookups required • The tradeoff is basically just disk space
  • 28. Embed vs Top-Level Collections? • Major question of MongoDB schema design • If you can ask the question at all, you might want to err on the side of embedding • Don’t embed if the embedding could get huge • Don’t feel too bad about denormalizing by embedding AND storing in a top-level collection
  • 29. Typical Properties of Top-Level Collections • Independence: They don’t “belong” conceptually to another collection • Nouns: the building blocks of your system • Easily referenceable and updatable
  • 30. Embedding Pros • Super-fast retrieval of document with related data • Atomic updates • “Ownership” of embedded document is obvious • Usually maps well to code structures
  • 31. Embedding Cons • Harder to get at, do mass queries • Does not size up infinitely, will hit 16MB limit • Hard to create references to embedded object • Limited ability to indexed-sort the embedded objects
  • 32. If You Think You Can Embed • You probably should • I take advantage of embedding in my designs more often now than I did three years ago • It’s a gift MongoDB gives you in exchange for giving up your joins
  • 33. Design Example: User Permissions • Users can have various broad permission levels for any number of clients • For example, user ‘ploki’ might have permission level ‘admin’ for client 76 and permission level ‘reports_only’ for client 450
  • 34. How Will We Use This Data? • Retrieve all clients for a given user • Retrieve all users for a given client • Retrieve a permission level for a given client for a given user
  • 35. How Will This Data Grow? • In the medium term, it will stay small • Number of clients and number of users can both grow infinitely
  • 36. Back in SQL-land • There’s a fairly standard way to do it • It’s a many-many relationship, so • Use a join table (client_user)
  • 37. Should We Use a New Top-Level Collection? db.client.user.save( { client_id: 76, username: ‘ploki’, permission: ‘admin’, }); db.client.user.save( { client_id: 450, username: ‘ploki’, permission: ‘reports_only’, }); db.client.user.ensureIndex( { client_id: 1 } ); db.client.user.ensureIndex( { username: 1 } ); // get all users belonging to a client db.client.user.find( { client_id: 76 } ); // get all clients a user has access to db.client.user.find( { username: ‘ibwhite’ } ); // get permissions for our current user db.client.user.findOne( { username: user.name } );
  • 38. Probably Not • Only needed if we have lots of clients per user AND lots of users per client • This is a case where we can embed, so let’s do so
  • 39. Three Ways to Embed ‘clients’: { ‘76’: ‘admin’, Not good: Object ‘450’: ‘reports_only’, can’t do a multikeys index }, on the keys of a hash index:??? Okay: Array ‘clients’: [ {‘_id’: 76, ‘access’: ‘admin’}, but have to search through array of objects }, {‘_id’: 450, ‘access’: ‘reports_only’} to find by _id index: { ‘clients._id’: 1 } on retrieved doc ‘clients’: [ 76, 450 ], Our approach: Array ‘clients_access’: { ’76’: ‘admin’, Fields next to each other alphabetically and object ‘450’: ‘reports_only’, } index: { clients: 1 }
  • 40. Indexes • Index all highly frequent queries • Do less-indexed queries only on secondaries • Reduce the size of indexes whereever you can on big collections • Don’t sweat the medium-sized collections, focus on the big wins
  • 41. Take Advantage of Multiple-Field Indexes • Order matters • If you have an index on {client_id: 1, email: 1 } • Then you also have the {client_id: 1} index “for free” • but not { email: 1}
  • 42. Use your _id • You must use an _id for every collection, which will cost you index size • So do something useful with _id
  • 43. Take advantage of fast ^indexes • Messages have _ids like: 32423.00000341 • Need all messages in blast 32423: • db.message.blast.find( { _id: /^32423./ } ); • (Yeah, I know the . is ugly. Don’t use a dot if you do this.)
  • 44. Manual Range Partioning • We moved a big message.blast collection into per-day collections: • message.blast.20110605 message.blast.20110606 message.blast.20110607 etc... • Keeps working set indexes smaller • When we move data into the archive, drop() is much faster than remove()
  • 45. Questions? Looking for a job? ian@sailthru.com twitter.com/eonwhite

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n