Mongo DB Monitoring - Become a MongoDB DBA

Copyright 2017 Severalnines AB
MongoDB Monitoring
Art van Scheppingen
Senior Support Engineer, Severalnines
Become a MongoDB DBA - Monitoring Essentials

Copyright 2017 Severalnines ABCopyright 2017 Severalnines AB
● Monitoring and trending
● Why do we collect data?
● What metrics to collect from MongoDB?
● Key MongoDB metrics in depth
● Available MongoDB monitoring tools
● How to monitor MongoDB using ClusterControl
Agenda

Monitoring and trending

Do you need monitoring and trending?

There is only one person who can land a plane without instruments

● Monitoring system (i.e. Nagios)
○ Checks if services are healthy
○ Sends pages
● Trending system (i.e. Cacti, Graphite, Prometheus)
○ Collects metrics
○ Generate graphs
Monitoring vs Trending

● Do more than just opening a connection
○ Measure true status of nodes and cluster
○ Test read/write
○ Open essential databases and collections
○ Keep an eye on the replication lag
■ Increase oplog size?
○ Check the full topology
Monitoring: Availability

● Trending
○ Plot trends of key (performance) metrics
○ Create timelines of metrics
○ Correlate various metrics
○ Find problems before they arise
○ Pre-emptive problem management
● Trending tools
○ Granularity of sampling
○ More datapoints = better
Trending: why do we need trends?

Why do we collect data?

● Periodical (daily/weekly) healthchecks
● Insight into all aspects of the database operations
● Post mortem and proactive monitoring
● Capacity planning
Why do we collect data?

● Healthchecks are a pain
● You want to see aggregated
data
● You want to be able to drill
down to a particular host
● You want to see the most
important data first and dig in
later on
Healthchecks

● Ability to dig into past data
● Even less than 5s of data
granularity
(hardware-dependent)
● Low granularity allows you to
catch the issue as it evolves -
no need to wait 5 minutes
for a graph to refresh
Post mortem and proactive monitoring

● Graphs based on MongoDB
status metrics
● Overall status and per-node
graphs
● Ability to get a timeshifted
graphs - useful for
comparing workload
changes across the time
Insight into internals, capacity planning

What metrics to collect from MongoDB?

● Quite similar to other database systems
○ Host metrics
○ Operational metrics
○ Storage engine metrics
○ Replication metrics
○ Shard metrics
Type of metrics to collect

● Similar to most other databases
● Understand the utilization of the machine
● Capacity planning
● Determine the type of an issue
○ I/O related?
○ CPU related?
○ Network related?
Host metrics: what for?

● CPU utilization (should I add more nodes to the cluster?)
● Network utilization (am I running out of bandwidth?)
● Ping (how badly latency affects my MongoDB cluster?)
● Disk throughput and IOPS (am I within my hardware limits?)
● Disk space (do I have to plan for larger disks?)
● Memory utilization (do I suffer from a memory leak?)
Host metrics: what to look for?

● Similar to most other databases
● Throughput of the cluster
● Relate throughput to cluster performance
● Determine the type of an issue
○ Request spikes?
○ Write amplification related?
○ Queueing?
Operational metrics

● Storage engine specific
○ MMAP
○ Wired Tiger
○ MongoRocks
● Insight in how the engine performs
● Internal congestion
Storage engine metrics

● Throughput of the replication
● Durability of the oplog
● Replication lag
● Cluster replication acknowledgement
○ Quorum based
○ At least one secondary needs to acknowledge
Replication metrics

Eventual consistency

● Shard chunks and balancing
○ Chunks per shard
○ Disk usage
● Non-sharded collections
○ Sharding has to be enabled on collection level
○ Non-sharded collections get a primary shard assigned
○ Once the primary shard is full, no writes can happen
● Connection pool (mongos)
○ All queries will be sent to the primary in a shard
○ Range queries will block connections of the connection pool
Sharding related metrics

Key MongoDB metrics to know about

● Oplog: a special collection containing all transactions
○ Limited in size (configurable)
○ Eviction of transactions (FIFO)
○ Comparable to a ringbuffer
● Used for replication
○ Secondaries copy transactions from the oplog on other nodes
○ Full data sync necessary once the last executed transaction has been evicted
● Replication window
○ Time between first and last transaction in the oplog
○ Time that allows your secondary to be offline before performing a full sync
Oplog

From the MongoDB CLI
mongo_replica_0:PRIMARY> db.getReplicationInfo()
{
"logSizeMB" : 1895.7751951217651,
"usedMB" : 419.86,
"timeDiff" : 281419,
"timeDiffHours" : 78.17,
"tFirst" : "Fri Jul 08 2016 10:56:01 GMT+0000 (UTC)",
"tLast" : "Mon Jul 11 2016 17:06:20 GMT+0000 (UTC)",
"now" : "Mon Jul 11 2016 17:15:06 GMT+0000 (UTC)"
}
Oplog: replication window

From the ClusterControl advisor:
function getReplicationWindow(host) {
var replwindow = {};
replwindow['newset'] = false;
// Fetch the first and last record from the Oplog and take it's timestamp
var res = host.executeMongoQuery("local", '{find: "oplog.rs", sort: { $natural: 1}, limit: 1}');
replwindow['first'] = res["result"]["cursor"]["firstBatch"][0]["ts"]["$timestamp"]["t"];
if (res["result"]["cursor"]["firstBatch"][0]["o"]["msg"] == "initiating set") {
replwindow['newset'] = true;
}
res = host.executeMongoQuery("local", '{find: "oplog.rs", sort: { $natural: -1}, limit: 1}');
replwindow['last'] = res["result"]["cursor"]["firstBatch"][0]["ts"]["$timestamp"]["t"];
replwindow['replwindow'] = replwindow['last'] - replwindow['first'];
return replwindow;
}
Oplog: replication window

● CPU, IO or lock related
● Outcome:
○ Secondary not used by Mongo client drivers
○ Puts larger strain on other secondaries
○ Less likely to be elected during a failover
■ If it will be elected it could be disastrous
○ Lagging behind too far could cause a full sync
Replication lag

my_mongodb_0:PRIMARY> db.runCommand( { replSetGetStatus: 1 } ) {
…
"members" : [
{
"_id" : 0,
"name" : "10.10.32.11:27017",
"stateStr" : "PRIMARY",
"optime" : {
"ts" : Timestamp(1466247801, 5),
"t" : NumberLong(1)
},
},
{
"_id" : 1,
"name" : "10.10.32.12:27017",
"stateStr" : "SECONDARY",
"optime" : {
"ts" : Timestamp(1466247801, 5),
"t" : NumberLong(1)
},
},
…
],
"ok" : 1
}
Replication lag

● Like any other databases: availability
● Client drivers may support connection pooling
○ Multiple non-blocking queries can use the same connection
○ Spawns new connections when low on threshold
● Increase of connections
○ Locking issues
○ Application request bursts
Connections

mongo_replica_0:PRIMARY> db.serverStatus().connections
{ "current" : 25, "available" : 794, "totalCreated" : NumberLong(122418) }
From any mongo client
mongo_replica_0:PRIMARY> db.runCommand( { serverStatus: 1 } ).connections
{ "current" : 25, "available" : 794, "totalCreated" : NumberLong(122418) }
Connections

● Atomicity on document level
○ Wiredtiger and MongoRocks
● No “real” transactions
● Write data with the $isolated operator
○ Similar to READ UNCOMMITTED in MySQL (dirty reads in ANSI SQL)
○ No rollback
○ Does not work on shards
Transactions

Transactions
mongo_replica_0:PRIMARY> db.serverStatus().opcounters
{
"insert" : 1355272,
"query" : 20712,
"update" : 8995,
"delete" : 0,
"getmore" : 400791,
"command" : 2405749
}
From any mongo client
mongo_replica_0:PRIMARY> db.runCommand({serverStatus: 1}).opcounters
{
"insert" : 1355272,
"query" : 20712,
"update" : 8995,
"delete" : 0,
"getmore" : 400791,
"command" : 2405749
}
Transactions

● Three levels of (generic) locking
○ Global
○ Database
○ Collection
● Global lock hardly ever happens (full lock on MongoDB)
● Database locks occur when dropping a collection
● Collection locks occur mostly in MMAP
Locks

mongo_replica_0:PRIMARY> db.serverStatus().locks
{
"Global" : {
"acquireCount" : {
"r" : NumberLong(6050583),
"w" : NumberLong(2416551),
"R" : NumberLong(1),
"W" : NumberLong(7)
},
"acquireWaitCount" : {
"r" : NumberLong(1),
"w" : NumberLong(1),
"W" : NumberLong(1)
},
…
}
Locks (generic)

● Optimistic concurrency control
○ If two write operations conflict, the transaction will be paused and retried
● Document level locking
● Tickets (threads)
○ Read
○ Write
Locks (WiredTiger)

mongo_replica_0:PRIMARY> db.serverStatus().wiredTiger.concurrentTransactions
{
"write" : {
"out" : 0,
"available" : 128,
"totalTickets" : 128
},
"read" : {
"out" : 0,
"available" : 128,
"totalTickets" : 128
}
}
Locks (WiredTiger)

● MongoDB uses three tiers of cache
○ Filesystem
○ Active memory
○ Storage engine (WiredTiger / MongoRocks)
● Page faults
○ Cache miss
● Evictions
Cache

mongo_replica_0:PRIMARY> db.serverStatus().extra_info.page_faults
37912924
mongo_replica_0:PRIMARY> db.serverStatus().wiredTiger.cache
{
"bytes currently in the cache" : 887889617,
"modified pages evicted" : 561514,
"tracked dirty pages in the cache" : 626,
"unmodified pages evicted" : 15823118
}
Page faults and cache usage

● Shards make write scaling transparently
● Sharding can be solved with two methods:
○ Hash key distribution (limited)
○ Shard lookup table
● MongoDB uses a combination of hash key distribution and shard lookup table
○ Hash key (or range key) distribution gets divided into chunks (ranges)
○ The chunk metadata gets stored in the config server
● The config server is the most important data in a MongoDB sharded cluster!
● The shard router is the the second most important component
● Shards can get out of balance
○ Non-sharded collections
○ Heavy / large writes on a single chunk
○ Auto balancing by the primary of the Config server (3.4) or mongos (< 3.2)
Shard metrics

From the MongoDB CLI:
mongos> sh.status()
--- Sharding Status ---
…
databases:
{ "_id" : "shardtest", "primary" : "sh1", "partitioned" : true }
shardtest.collection
shard key: { "_id" : 1 }
unique: false
balancing: true
chunks:
sh1 1
sh2 2
sh3 1
From any mongo client:
mongos> use config
switched to db config
mongos> db.config.runCommand({aggregate: "chunks", pipeline: [{$group: {"_id": {"ns": "$ns", "shard": "$shard"}, "total_chunks": {$sum: 1}}}]})
{ "_id" : { "ns" : "test.usertable", "shard" : "mongo_replica_1" }, "total_chunks" : 330 }
Shard chunks and balancing

From the ClusterControl non-sharded collection advisor:
use config;
var shard_collections = db.collections.find();
var sharded_names = {};
while (shard_collections.hasNext()) {
shard = shard_collections.next();
sharded_names[shard._id] = 1;
}
var admin_db = db.getSiblingDB("admin");
dbs = admin_db.runCommand({ "listDatabases": 1 }).databases;
dbs.forEach(function(database) {
if (database.name != "config") {
db = db.getSiblingDB(database.name);
cols = db.getCollectionNames();
cols.forEach(function(col) {
if( col != "system.indexes" ) {
if( shard_names[database.name + "." + col] != 1) {
print (database.name + "." + col);
}
}
});
}
});
Non-sharded collections

mongos> db.runCommand( { "connPoolStats" : 1 } )
{
"numClientConnections" : 10,
"numAScopedConnections" : 0,
"totalInUse" : 4,
"totalAvailable" : 8,
"totalCreated" : 23,
"hosts" : {
"10.10.34.11:27019" : {
"inUse" : 1,
"available" : 1,
"created" : 1
},
"10.10.34.12:27018" : {
"inUse" : 3,
"available" : 1,
"created" : 2
}
},
...
"ok" : 1
}
Connection pool

Available MongoDB monitoring tools

● Open Source
○ Nagios
○ Zabbix
● Subscription based
○ MongoDB Cloud Manager
○ VividCortex
○ ClusterControl
Alerting solutions

● Nagios-MongoDB
○ https://github.com/mzupan/nagios-plugin-mongodb/
○ Performs some very important checks
■ Replication lag
■ Lock time percentage
■ Index miss ratio
Nagios

● MongoDB Zabbix monitoring plugin
○ https://github.com/nightw/mikoomi-zabbix-mongodb-monitoring
○ All the necessary metrics and more
■ Entries in oplog
○ Pre-canned triggers
Zabbix

● Trending tools
○ Statsd/Grafana
○ Cacti
○ Zabbix
● Subscription based
○ MongoDB Cloud Manager
○ VividCortex
○ ClusterControl
Trending solutions

● Percona MongoDB Monitoring Templates
○ https://www.percona.com/doc/percona-monitoring-plugins/1.1/cacti/mongodb-templates.
html
Cacti

● PMM
○ https://www.percona.com/doc/percona-monitoring-and-management/
○ Open Source Monitoring & Management framework
○ Can deploy, manage and monitor MySQL & MongoDB
○ Uses Prometheus and Grafana
Orchestration systems: Percona Monitoring & Management

● PMM
○ https://www.percona.com/doc/percona-monitoring-and-management/
○ Open Source Monitoring & Management framework
○ Can deploy, manage and monitor MySQL & MongoDB
○ Uses Prometheus and Grafana
Percona Monitoring & Management sessions:
● MySQL Monitoring with Percona Monitoring and Management, Tue 11:30 - 12:20 in Ballroom E
● Hipster MySQL Monitoring: Serving a deconstructed PMM, Tue 11:30 - 12:20 in Ballroom H
● Monitoring production environment with Percona Monitoring and Management (PMM), Thu 3:00 - 3:50 in room 209
Orchestration systems: Percona Monitoring & Management

How to monitor MongoDB using ClusterControl

● ClusterControl
○ http://www.severalnines.com
○ Deploy Mongo shards & replicasets
○ Monitor and trend
○ Manage configuration and backups
○ Scale
○ Community edition
Orechestration systems: ClusterControl

Easily deploy and import MongoDB replicaSets and Shards

Monitor and trend

Cluster management

Scale replicaSets and Shards

Convert replicaSet into a Sharded cluster

Q & A

● Blog series: Become a MongoDB DBA
○ http://severalnines.com/blog-categories/mongodb
● Webinar series: Become a MongoDB DBA
○ http://severalnines.com/upcoming-webinars
● Visit our website for more resources!
○ http://www.severalnines.com
● Stop by our booth in the exhibit hall
● Other sessions by Severalnines at Percona Live 2017
MySQL Load Balancers - MaxScale, ProxySQL, HAProxy, MySQL Router & nginx - a close up look, Wed
11:10am - 12:00pm in Ballroom D
MySQL (NDB) Cluster Best Practices (Die Hard VIII), Wed 3:30pm - 4:20pm in Room 210
Additional resources

Mongo DB Monitoring - Become a MongoDB DBA

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Mongo DB Monitoring - Become a MongoDB DBA

Ähnlich wie Mongo DB Monitoring - Become a MongoDB DBA (20)

Mehr von Severalnines

Mehr von Severalnines (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mongo DB Monitoring - Become a MongoDB DBA