2. 2JUNE 2014
Agenda
• CMP.LY and CommandPost
• What is MongoDB Management Service?
• Performance Tuning
• MongoDB Issues we’ve faced
• Slow response times and delayed writes
• Unindexed queries
• Increased Replication Lag and Plummeting oplog Window
• Keep your deployment healthy with MMS
• Using MMS Alerts
• Using MMS Backups
3. 3JUNE 2014
A venture-funded NYC startup that offers proprietary social media, monitoring,
measurement, insight and compliance solutions for Fortune 100
A Monitoring, Measurement & Insights (MMI) tool for managed social
communications.
4. 4JUNE 2014
Use CommandPost to:
• Track and measure cross-platform in real-time
• Identify and attribute high-value engagement
• Analyze and segment engaged audience
• Optimize content and engagement strategies
• Address compliance needs
6. 6JUNE 2014
MongoDB Management Service
• Free MongoDB Monitoring
• MongoDB Backup in the Cloud
• Free Cloud service or Available
to run On-Prem for Standard or
Enterprise Subscriptions
• Automation coming soon—FTW!
Ops
Makes MongoDB easier to use and
manage
7. 7JUNE 2014
Who Is MMS for?
• Developers
• Ops Team
• MongoDB Technical Service Team
9. 9JUNE 2014
How To Do Performance Tuning?
• Assess the problem and establish acceptable behavior.
• Measure the performance before modification.
• Identify the bottleneck.
• Remove the bottleneck.
• Measure performance after modification to confirm.
• Keep it or revert it and repeat.
Adapted from [http://en.wikipedia.org/wiki/Performance_tuning]
13. 13JUNE 2014
Concurrency
• What is it?
• How did it affect us?
• How did MMS help identify it?
• How did we diagnose the issue in our app and fix it?
• Today
14. 14JUNE 2014
Concurrency in MongoDB
• MongoDB uses a readers-writer lock
• Many read operations can use a read lock
• If a write lock exists, a single write lock holds the lock exclusively
• No other read or write operations can share the lock
• Locks are “writer-greedy”
15. 15JUNE 2014
How Did This Affect Us?
• Slow API response times due to slow database operations
• Delayed writes
• Backed up queues
17. 17JUNE 2014
Lock % Greater than 100%?!?!?
• time spent in write lock state; sum of global lock + hottest database at that time,
can make value > 100%
• Global lock percentage is a derived metric:
% of time in global lock (small number)
+
% of time locked by hottest (“most locked”) database
• Data is sampled and combined, it is possible to see values over 100%.
18. 18JUNE 2014
Diagnosis
• Identified the write-heavy collections in our applications
• Used application logs to identify slow API responses
• Analyzed MongoDB logs to identify slow database queries
22. 22JUNE 2014
Message Queues
• Controlled writes to specific collections using Pub/Sub
• We chose Amazon SQS
• Other options include Redis, Beanstalkd, IronMQ or any other message queue
• Created consistent flow of writes versus bursts
• Reduced length and frequency of write locks by controlling flow/speed of writes
23. 23JUNE 2014
Using Multiple Databases
• As of version 2.2, MongoDB implements locks at a per database granularity for
most read and write operations
• Planned to be at the document level in version 2.8
• Moved write-heavy collections to new (separate) databases
24. 24JUNE 2014
Using Sharding
• Improves concurrency by distributing databases across multiple mongod
instances
• Locks are per-mongod instance
27. 27JUNE 2014
Indexing
• What is it?
• How did it affect us?
• How did MMS help identify it?
• How did we diagnose the issue in our app and fix it?
• Today
28. 28JUNE 2014
Indexing with MongoDB
• Support for efficient execution of queries
• Without indexes, MongoDB must scan every document
• Example
Wed Jul 17 13:40:14 [conn28600] query x.y [snip] ntoreturn:16 ntoskip:0
nscanned:16779 scanAndOrder:1 keyUpdates:0 numYields: 906 locks(micros)
r:46877422 nreturned:16 reslen:6948 38172ms
38 seconds! Scanned 17k documents, returned 16
• Create indexes to cover all queries, especially support common and user-facing
• Collection scans can push entire working set out of RAM
29. 29JUNE 2014
How Did this Affect Us?
• Our web apps became slow
• Queries began to timeout
• Longer operations mean longer lock times
30. 30JUNE 2014
MMS: Identifying Indexing Issues
Page Faults
• The number of times that
MongoDB requires data
not located in physical
memory, and must read
from virtual memory.
31. 31JUNE 2014
Diagnosis
• Log Analysis
• Use mtools to analyze MongoDB logs
• mlogfilter
• filter logs for slow queries, collection scans, etc.
• mplotqueries
• graph query response times and volumes
• https://github.com/rueckstiess/mtools
32. 32JUNE 2014
Diagnosis
• Monitoring application logs
• Enabling ‘notablescan’ option in development and testing versions of apps
• MongoDB profiling
33. 33JUNE 2014
The MongoDB Profiler
• Collects fine grained data about MongoDB write operations, cursors, database
commands on a running mongod instance.
• Default slowOpThreshold value is 100ms, can be changed from the Mongo shell
34. 34JUNE 2014
Our Remedies
• Add indexes!
• Make sure queries are covered
• Utilize the projection specification to limit fields (data) returned
35. 35JUNE 2014
Adding Indexes
• Improved performance for common queries
• Alleviates the need to go to disk for many operations
36. 36JUNE 2014
Projection Specification
Controls the amount of data that needs to be (de-)serialized for use in your app
• We used it to limit data returned in embedded documents and arrays
db.inventory.find( { type: 'food' }, { item: 1, qty: 1 } )
39. 39JUNE 2014
Replication
• What is it?
• How did it affect us?
• How did MMS help identify it?
• How did we diagnose the issue in our app?
• How did we fix it?
• Today
40. 40JUNE 2014
What is Replication?
• A replica set is a group of mongod
processes that maintain the same data
set.
• Replica sets provide redundancy and
high availability, and are the basis for all
production deployments
41. 41JUNE 2014
What Is the Oplog?
• A special capped collection that keeps a rolling record of all operations that
modify the data stored in your databases.
• Operations are first applied on the primary and then recorded to its oplog.
• Secondary members then copy and apply these operations in an asynchronous
process.
42. 42JUNE 2014
What is Replication Lag?
• A delay between an operation on the primary and the application of that
operation from the oplog to the secondary.
• Effects of excessive lag
• “Lagged” members ineligible to quickly become primary
• Increases the possibility that distributed read operations will be inconsistent.
43. 43JUNE 2014
How did this affect us?
• Degraded overall health of our production deployment.
• Distributed reads are no longer eventually consistent.
• Unable to bring new secondary members online.
• Caused MMS Backups to do full re-syncs.
45. 45JUNE 2014
Diagnosis
• Possible causes of replication lag include network latency, disk throughput,
concurrency and/or appropriate write concern
• Size of operations to be replicated
• Confirmed Non-Issues for us
• Network latency
• Disk throughput
• Possible Issues for us
• Concurrency/write concern
• Size of op is an issue because entire document is written to oplog
46. 46JUNE 2014
Concurrency/Write Concern
• Our applications apply many updates very quickly
• All operations need to be replicated to secondary members
• We use the default write concern—Acknowledge
• The mongod confirms receipt of the write operation
• Allows clients to catch network, duplicate key and other errors
48. 48JUNE 2014
Operation Size Was the Issue
Collection A (most active)
Total Updates: 3,373
Total Size of updates: 6.5 GB
Activity accounted for nearly 87% of total traffic
Collection B (next most active)
Total Updates: 85,423
Total Size of updates: 740 MB
49. 49JUNE 2014
Fast Growing oplog causes issues
Replication oplog Window – approximate hours available in the primary’s oplog
50. 50JUNE 2014
How We Fixed It
• Changed our schema
• Changed the types of updates that were made to documents
• Both allowed us to utilize atomic operations
• Led to smaller updates
• Smaller updates == less oplog space used
55. 55JUNE 2014
Watch for Warnings
• Be warned if you are
• Running outdated versions
• Have startup warnings
• If a mongod is publicly visible
• Pay attention to these warnings
56. 56JUNE 2014
MMS Backups
• Engineered by MongoDB
• Continuous backup with point-in-time recovery
• Fully managed backups
57. 57JUNE 2014
Using MMS Backups
• Seeding new secondaries
• Repairing replica set members
• Development and testing databases
• Restores are free!
58. 58JUNE 2014
Summary
• Know what’s expected and “normal” in your systems
• Know when and what changes in your systems
• Utilize MMS alerts, visualizations and warnings to keep things running smoothly
Developers, what we’re focused on today – track bottlenecks
Ops team :: great for small teams where your developers are also part of your ops team (DevOps) – monitor health of clusters, backup dbs, automate updates and add capacity
MongoDB technical service team :: helps them help you
Important for us because we maintain a small tech team
PRO-TIP: Know what is “normal” for you system.
Know what changed when something happens, what do you expect to be normal behavior, what are you normal MMS metrics
readers-writer lock allows concurrent read access to the db,
but exclusive access to a single write
“Writer-greedy” - When both a read and write are waiting for a
lock, MongoDB grants the lock to the write.
The exclusivity of write locks is one of the keys to why getting
our lock % under control is so important.
Lock %
time spent in write lock state; sum of global lock + hottest database at that time, can make value > 100%
Our Issue: Primary database maintaining a write lock of 150-175% of the time
Global lock percentage has remained about the same
Primary client-facing database has seen lock % drop
Developed by a MongoDB engineer
- Purple bar indicates downtime
- Alerts for down hosts, down agents and more
- According to Technical Services, In many cases, fixing warnings will fix issues