This presentation was presented by Art van Scheppingen at Percona Live 2017 in Santa Clara CA and covers what you need to know to effectively monitor MongoDB
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Mongo DB Monitoring - Become a MongoDB DBA
1. Copyright 2017 Severalnines AB
MongoDB Monitoring
Art van Scheppingen
Senior Support Engineer, Severalnines
Become a MongoDB DBA - Monitoring Essentials
2. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Monitoring and trending
● Why do we collect data?
● What metrics to collect from MongoDB?
● Key MongoDB metrics in depth
● Available MongoDB monitoring tools
● How to monitor MongoDB using ClusterControl
Agenda
5. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
There is only one person who can land a plane without instruments
6. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Monitoring system (i.e. Nagios)
○ Checks if services are healthy
○ Sends pages
● Trending system (i.e. Cacti, Graphite, Prometheus)
○ Collects metrics
○ Generate graphs
Monitoring vs Trending
7. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Do more than just opening a connection
○ Measure true status of nodes and cluster
○ Test read/write
○ Open essential databases and collections
○ Keep an eye on the replication lag
■ Increase oplog size?
○ Check the full topology
Monitoring: Availability
8. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Trending
○ Plot trends of key (performance) metrics
○ Create timelines of metrics
○ Correlate various metrics
○ Find problems before they arise
○ Pre-emptive problem management
● Trending tools
○ Granularity of sampling
○ More datapoints = better
Trending: why do we need trends?
10. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Periodical (daily/weekly) healthchecks
● Insight into all aspects of the database operations
● Post mortem and proactive monitoring
● Capacity planning
Why do we collect data?
11. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Healthchecks are a pain
● You want to see aggregated
data
● You want to be able to drill
down to a particular host
● You want to see the most
important data first and dig in
later on
Healthchecks
12. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Ability to dig into past data
● Even less than 5s of data
granularity
(hardware-dependent)
● Low granularity allows you to
catch the issue as it evolves -
no need to wait 5 minutes
for a graph to refresh
Post mortem and proactive monitoring
13. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Graphs based on MongoDB
status metrics
● Overall status and per-node
graphs
● Ability to get a timeshifted
graphs - useful for
comparing workload
changes across the time
Insight into internals, capacity planning
15. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Quite similar to other database systems
○ Host metrics
○ Operational metrics
○ Storage engine metrics
○ Replication metrics
○ Shard metrics
Type of metrics to collect
16. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Similar to most other databases
● Understand the utilization of the machine
● Capacity planning
● Determine the type of an issue
○ I/O related?
○ CPU related?
○ Network related?
Host metrics: what for?
17. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● CPU utilization (should I add more nodes to the cluster?)
● Network utilization (am I running out of bandwidth?)
● Ping (how badly latency affects my MongoDB cluster?)
● Disk throughput and IOPS (am I within my hardware limits?)
● Disk space (do I have to plan for larger disks?)
● Memory utilization (do I suffer from a memory leak?)
Host metrics: what to look for?
18. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Similar to most other databases
● Throughput of the cluster
● Relate throughput to cluster performance
● Determine the type of an issue
○ Request spikes?
○ Write amplification related?
○ Queueing?
Operational metrics
19. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Storage engine specific
○ MMAP
○ Wired Tiger
○ MongoRocks
● Insight in how the engine performs
● Internal congestion
Storage engine metrics
20. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Throughput of the replication
● Durability of the oplog
● Replication lag
● Cluster replication acknowledgement
○ Quorum based
○ At least one secondary needs to acknowledge
Replication metrics
22. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Shard chunks and balancing
○ Chunks per shard
○ Disk usage
● Non-sharded collections
○ Sharding has to be enabled on collection level
○ Non-sharded collections get a primary shard assigned
○ Once the primary shard is full, no writes can happen
● Connection pool (mongos)
○ All queries will be sent to the primary in a shard
○ Range queries will block connections of the connection pool
Sharding related metrics
24. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Oplog: a special collection containing all transactions
○ Limited in size (configurable)
○ Eviction of transactions (FIFO)
○ Comparable to a ringbuffer
● Used for replication
○ Secondaries copy transactions from the oplog on other nodes
○ Full data sync necessary once the last executed transaction has been evicted
● Replication window
○ Time between first and last transaction in the oplog
○ Time that allows your secondary to be offline before performing a full sync
Oplog
26. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
From the ClusterControl advisor:
function getReplicationWindow(host) {
var replwindow = {};
replwindow['newset'] = false;
// Fetch the first and last record from the Oplog and take it's timestamp
var res = host.executeMongoQuery("local", '{find: "oplog.rs", sort: { $natural: 1}, limit: 1}');
replwindow['first'] = res["result"]["cursor"]["firstBatch"][0]["ts"]["$timestamp"]["t"];
if (res["result"]["cursor"]["firstBatch"][0]["o"]["msg"] == "initiating set") {
replwindow['newset'] = true;
}
res = host.executeMongoQuery("local", '{find: "oplog.rs", sort: { $natural: -1}, limit: 1}');
replwindow['last'] = res["result"]["cursor"]["firstBatch"][0]["ts"]["$timestamp"]["t"];
replwindow['replwindow'] = replwindow['last'] - replwindow['first'];
return replwindow;
}
Oplog: replication window
27. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● CPU, IO or lock related
● Outcome:
○ Secondary not used by Mongo client drivers
○ Puts larger strain on other secondaries
○ Less likely to be elected during a failover
■ If it will be elected it could be disastrous
○ Lagging behind too far could cause a full sync
Replication lag
29. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Like any other databases: availability
● Client drivers may support connection pooling
○ Multiple non-blocking queries can use the same connection
○ Spawns new connections when low on threshold
● Increase of connections
○ Locking issues
○ Application request bursts
Connections
30. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
From the MongoDB CLI
mongo_replica_0:PRIMARY> db.serverStatus().connections
{ "current" : 25, "available" : 794, "totalCreated" : NumberLong(122418) }
From any mongo client
mongo_replica_0:PRIMARY> db.runCommand( { serverStatus: 1 } ).connections
{ "current" : 25, "available" : 794, "totalCreated" : NumberLong(122418) }
Connections
31. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Atomicity on document level
○ Wiredtiger and MongoRocks
● No “real” transactions
● Write data with the $isolated operator
○ Similar to READ UNCOMMITTED in MySQL (dirty reads in ANSI SQL)
○ No rollback
○ Does not work on shards
Transactions
35. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Optimistic concurrency control
○ If two write operations conflict, the transaction will be paused and retried
● Document level locking
● Tickets (threads)
○ Read
○ Write
Locks (WiredTiger)
37. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● MongoDB uses three tiers of cache
○ Filesystem
○ Active memory
○ Storage engine (WiredTiger / MongoRocks)
● Page faults
○ Cache miss
● Evictions
Cache
38. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
From the MongoDB CLI
mongo_replica_0:PRIMARY> db.serverStatus().extra_info.page_faults
37912924
mongo_replica_0:PRIMARY> db.serverStatus().wiredTiger.cache
{
"bytes currently in the cache" : 887889617,
"modified pages evicted" : 561514,
"tracked dirty pages in the cache" : 626,
"unmodified pages evicted" : 15823118
}
Page faults and cache usage
39. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Shards make write scaling transparently
● Sharding can be solved with two methods:
○ Hash key distribution (limited)
○ Shard lookup table
● MongoDB uses a combination of hash key distribution and shard lookup table
○ Hash key (or range key) distribution gets divided into chunks (ranges)
○ The chunk metadata gets stored in the config server
● The config server is the most important data in a MongoDB sharded cluster!
● The shard router is the the second most important component
● Shards can get out of balance
○ Non-sharded collections
○ Heavy / large writes on a single chunk
○ Auto balancing by the primary of the Config server (3.4) or mongos (< 3.2)
Shard metrics
44. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Open Source
○ Nagios
○ Zabbix
● Subscription based
○ MongoDB Cloud Manager
○ VividCortex
○ ClusterControl
Alerting solutions
45. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Nagios-MongoDB
○ https://github.com/mzupan/nagios-plugin-mongodb/
○ Performs some very important checks
■ Replication lag
■ Lock time percentage
■ Index miss ratio
Nagios
46. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● MongoDB Zabbix monitoring plugin
○ https://github.com/nightw/mikoomi-zabbix-mongodb-monitoring
○ All the necessary metrics and more
■ Entries in oplog
○ Pre-canned triggers
Zabbix
48. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Percona MongoDB Monitoring Templates
○ https://www.percona.com/doc/percona-monitoring-plugins/1.1/cacti/mongodb-templates.
html
Cacti
49. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● PMM
○ https://www.percona.com/doc/percona-monitoring-and-management/
○ Open Source Monitoring & Management framework
○ Can deploy, manage and monitor MySQL & MongoDB
○ Uses Prometheus and Grafana
Orchestration systems: Percona Monitoring & Management
50. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● PMM
○ https://www.percona.com/doc/percona-monitoring-and-management/
○ Open Source Monitoring & Management framework
○ Can deploy, manage and monitor MySQL & MongoDB
○ Uses Prometheus and Grafana
Percona Monitoring & Management sessions:
● MySQL Monitoring with Percona Monitoring and Management, Tue 11:30 - 12:20 in Ballroom E
● Hipster MySQL Monitoring: Serving a deconstructed PMM, Tue 11:30 - 12:20 in Ballroom H
● Monitoring production environment with Percona Monitoring and Management (PMM), Thu 3:00 - 3:50 in room 209
Orchestration systems: Percona Monitoring & Management
59. Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Blog series: Become a MongoDB DBA
○ http://severalnines.com/blog-categories/mongodb
● Webinar series: Become a MongoDB DBA
○ http://severalnines.com/upcoming-webinars
● Visit our website for more resources!
○ http://www.severalnines.com
● Stop by our booth in the exhibit hall
● Other sessions by Severalnines at Percona Live 2017
MySQL Load Balancers - MaxScale, ProxySQL, HAProxy, MySQL Router & nginx - a close up look, Wed
11:10am - 12:00pm in Ballroom D
MySQL (NDB) Cluster Best Practices (Die Hard VIII), Wed 3:30pm - 4:20pm in Room 210
Additional resources