Scale out a mongod node
Senior Cyber Intelligence Analyst with:
Lockheed Martin Computer Incident Response Team
Network defense for the Lockheed Martin
My Background
3 years working in NASA’s Mission Control Center in Houston, TX
Mission Focused
International Collaboration solving problems
3 years working for Computer Incident Response Team
Mission Focused
International Collaboration
This is a story of lessons learned building a distributed enterprise monitoring framework, specifically the data storage and retrieval subsystem and beating up mongodb until it gave us the performance we wanted. And we got it to work.
What IS:
Ability To Extract Information
Tap a network
Largely Technical
Beyond typical NIDS/deep packet inspection
Garner Intelligence from that Information
Query focused data sets
Influence
Starts with an Application to apply the intelligence to the information to extract and store metadata.
Make actionable decisions (block a malicious email)
The whole thing wrapped up is just a tool for a person. At the end of the day, its main job is to Enable Critical Thinking.
Do NOT want users simply following a standard process as that hides authority and accountability (the process said so)
The Primary focus of this capability is critical thinking enablement. Allow the user to flow uninterrupted thoughts.
Data is everywhere. Bandwidth is limited
We don’t have infinite storage, and even if we did, old data doesn’t is better suited as a compressed blob on tape rather than in a queryable data store
Which leads to Access.
Tools must support the user, and the user both wants and needs a simple way to access the data. I’ll make note that simple does not need necessarily mean easy.
MongoDB scales out Extremely well inside a data center.
<click>
There’s endless resources available for building a mongodb cluster in a data center
One option would be to pull all your information back to the data center
<click>
However:
<next slide>
In in our use case it’s simply unreasonable to pull all information back to the data center. The sheer volume of information would overwhelm the link back to the datacenter.
It’s sometimes unreasonable to pull even the METADATA back to the data center
Sometimes we have the case where metadata happens to contain more information than the raw data itself.
Move your technical influence to the data.
For each information source, ship a “pizza box”
Store your metadata Locally on that “pizza box” to minimize wan traffic.
Make the data available for querying from home base.
Don’t levy a requirement that Field Engineers be MongoDB certified DBAs
Continuing the background:
Each pizza box looks at information, applies intelligence from another field, and takes appropriate actions.
This could be active such as blocking certain network traffic
This could be passive such as alerting operators of fishy information transiting a gateway
Of course, the action could be generate and store the metadata for later analysis.
Inside each one of these pizza boxes is a fully contained system that can process data, drive actions, and most importantly (for this audience) store the schema-less metadata.
Let’s return to the single pizza box
WHAT volume of information can a single pizza box support?
More importantly, how many boxes do I need to buy to support X amount of data?
How fast can I pump the orange circles while still being able to take the minimal “log” action.
One way to measure throughput is to count the number of orange circles I can push through the pizza box every second.
To be consistent with Mongo terminology, I’ll refer to this as documents per second from here on out.
This -number- is highly dependent on an excessive number of configuration details. On purpose, I’ll describe these in relative terms rather than in absolute terms.
Obviously, we want to maximize the documents/second the database can support.
What is Mongo’s recommendation?
The consensus
–pause—
is that ALL DATA should fit into RAM.
--pause--
If you cant fit the data in RAM, you should at least keep your indexes in RAM.
Throw new boxes into the cloud
Self Managing:
Put each pizza box into the cloud
<click>
Should be as simple as copy/paste to add new nodes into the cloud
<click>
The cloud should also allow nodes to come online/offline at will
<click>
This could be for standard maintenance, or even part of the operational concept.
Perhaps there is a node that is intentionally available during sunny hours only
<click>
I found there was no feasible method to manage this cluster using ‘standard’ mongodb methods.
Note “standard” there were tons of interesting ideas floating around the community, this presentation covers one.
In a dream world, the rate at which I can throw documents at MongoDB would not be in any way related to total database size.
Because I enjoy realistic dreams, I’ll acknowledge that the throughput can’t be infinite.
As the total data size goes up
<click>
As time goes on
–point to x-axis—
It would be great if the data size
--point to y-axis–
grew nice and linearly.
It would be really nice if the documents per second would stay constant throughout
<click>
--point to other y axis--
It should be able to maintain a high insert rate regardless of disk size.
As I’m sure you expect, the reality isn’t as promising.
Again, read this as a behavior, not a benchmark
This curve is a general behavior
Think about it! This really isn't that bad, once you get past the initial dropoff, it becomes “roughly” linear. It isn’t exactly what we wanted, but we might be able to work with this.
Of course you can alter the data or indexes and change what is meaningful to you, but the general trend will still look like this.
Of course you “lift” this up the green line (the document throughput) by adding memory sufficient to fit all data and indexes into RAM. You can use faster spinning disks, splurge for solid state drives, or even look at the more exotic options such as NAND flash.
At the end of the day, you are scaling !UP! In order to accomplish the performance increase.
You can scale out by making each pizza box a itself a shard cluster with a hand-ful of MongoDs, a config server, and a MongoS, but mongodb won’t let you stack a MongoS on top of another MongoS. You’d have to write your “global” mongos from scratch anyway.
MongoDB scales well inside the data center, but not so well in the field.
The big deal here is the ever increasing disk size required. We need a simple way to manage actual disk utilization.
Again, In a perfect world, the rate at which I can throw documents at MongoDB would not be in any way related to total database size.
As the total data size goes up, and the capped collection kicks in:
<click>
Remember, this is real utilized disk space and we are running the mongod on real, physical hardware. You must be absolutely, 100% sure the database size always fits in the allocated disk space.
There is VERY little room got “accidently” raise the blue line. That fact makes it very difficult to use the TTL collections as you may not know ahead of time the size of the data and when it is appropriate to expire/remove the data while optimizing data retention periods.
We’re engineers… we know things in life arent free, so we know something interesting is bound to happen here
<click>
The point where mongo transforms from growing with the data to overwriting itself.
But we arent sure what yet
<click>
It would be really nice if the documents per second would stay constant.
<click>
With a little bit of hackery, we got it.
BUT… remember, we just settled for the diminished throughput because we couldn’t afford to scale up to the point our data would fit in RAM, making it look more like this
<click>
It should be able to maintain a high insert rate regardless of disk size, and regardless of retention methods.
As I’m sure you expect, the reality isn’t as promising.
<click>
Remember, the Mongo database is still tied to actual physical disk space.
Lets assume we have a constant stream of information.
It is happily filling up the database according to the graph on the previous page.
<click>
Eventually, the database is going to fill up, what should we do.
For lack of a better idea, we might as well grab the the oldest data
<Click>
And throw them in the trash
<click>
to make room for the next document.
<click>
This particular process happily plays itself out over and over again and is actually built right into mongo as a “capped collection”
Capped collection also and has a cousin, the TTL index on documents that also manages document retention.
Pause - Absorb
As mongod enters the phase where it is managing the roll-off of old data,
<click>
It experiences a huge penalty “going off the cliff” when capped collection kicks in
<click>
I don’t intend to deep dive into how MongoDB extents are laid out for the documents and “next” pointers as they exist on disk.
But at a high level, Remember for each and every new document or batch coming into the database, mongo may be responsible for overwriting one or more documents from the database.
More importantly to this cliff, each and every new document or batch of documents containing an indexed field must be built into Mongo’s B-TREE indexing.
EVEN more disastrous, each document or batch of document being trashed must be REMOVED from the B-TREE index.
How bad is it that you pay a performance penalty… to permanently REMOVE data from your index?
Of course you “lift” this up the green line (the document throughput) by scaling up.
Add memory to fit all data in RAM.
use faster spinning disks,
splurge for solid state,
or buy exotic NAND flash.
One of the great things about mongo is that is scales out well in the data center, but seems to require scaling UP within the node.
Querying the Cloud:
Still want this cloud to look and function like a query to MongoS.
<click>One query in,
<click>
one result set out.
Again, this cloud should scale-out like MongoS. However, it needs to perform the scale-out without the nodes talking to, or knowing about each other.
If a node is unreachable or unresponsive, the system should SIMPLY tell me about it. Let ME, the user decide what to do with that knowledge and whether to accept the data, or query again at a different time.
<click>
I envision a result set similar to this, a list of nodes that responded, a list of nodes that didn’t, and the documents matching the query.
This is an extremely simplified vision, but nevertheless provides the behavior we desire.
It would be even better if the cloud could tell me WHY a node didn’t respond.
Until ‘recently’, there wasn’t a good way within MongoS to allow queries to complete if one of the target shards is unresponsive.
One “slow” node should not hamper the entire cluster. If it is slow/overloaded, tell me, but don’t hold up results from the rest of the cloud.
If we can Achieve this system, we successfully architected a scale-out cloud mongodb cluster
Let’s go back to critical thinking
Tools MUST Support The Analyst
Let’s think about your standard information retrieval that you might expect out of a O-L-T-P data store in a security context.
You get an IP address to think about.
Dramatic pause… does everyone have it? Dot 2-4-7?
Your O-L-T-P gives you back some data
<click>
Pause a moment
Let that sink in:After 1.0 seconds, you are already beginning to forget WHY! You were looking for this specific address.
You moved into turning-the-crank rather than critical thinking
Extending that for a moment
<click>
Pause for reading
10 seconds.
45 plus years ago we solved the riddle.
10 seconds waiting for an answer and our ability to maintain a critical thinking with a problem solving mindset is lost.
10 seconds.
What does this tell us?
In the out-of-the-box mongo world, new data is always streaming into one giant database that holds all the data. As time goes on, mongo happily manages your extents, adding disk size as needed.
<click>
If you use the capped collection, it allocates all your disk space up front, and happily overwrites old data with new data when the time comes.
<click>
We are already assuming the data size is greater than RAM size. Probably close to disk size.
If you can make the assumption that you are –more-- interested in recent data –this one here at the end-- than you are historical context, there isn’t a great reason to query then entire database,
<click>
just to get this data.
Remember there is some psychological value gained by ensuring the database can field queries within 10 seconds, or better yet, 1 second.
Rather than send one query to one database, if we can segment the data into buckets
Again, this hinges on the assumption that whatever you search for, you would want recent results –these up here– returned first.
Rather than dispatch the query to the entire database which is far outside ram at this point, you can dispatch the query to each segment.
<click>
In theory, this most recent bucket is “warmed”, so I’ve noted it in RED.
You can make the assumption that this bucket is taking a stream of insert operations. So… it’s fair to assume that the host OS allows it to remain in RAM.
The other buckets may not be so lucky and reside only in virtual memory.
I hope we agree, bucketing the data makes sense.
What's the best way to technically bucket the data inside mongo?
<click>
Make each bucket its own database!
Here comes the challenge. How do we manage databases as time goes on?
We need a way to generate new buckets as the precious bucket fills up.
<click>
As time goes on, the generator is responsible for creating new empty buckets
<click>
One easy way to scale-out mongod is to tightly manage the working set and make sure it stays in RAM.
Since only one bucket is “warmed” with the constant stream of inserts, that entire database is the current working set.
This means that only a single database, the current “warm” da must remain in RAM, others may go in and out with queries/map reduces at the host OS discretion.
Rather than one giant database, we focused on forcing the entire active bucket or database to stay in memory by making the bucket size smaller than RAM size
<click>
As each bucket fills up, the generator creates a new empty bucket and begins routing inserts to the new bucket.
<click>
But remember, we are still tied to physical disk space, and that space will fill up eventually.
We need to start trashing data
<click>
Similar to a capped collection, we should throw away the oldest data first, but rather than throw away 1 or 2 documents at a time, lets throw away the whole bucket to clean up space quickly.
<click>
We have successfully cleaned up disk space to allocate a new bucket.
We need a way to automate this process so we don’t manually curate the buckets.
<click>
Eventually, the field of buckets will fill, and the system will enter a stage where it needs to delete old data to make room for new data.
As each bucket gets close to full, a destructer will come by and delete the the next oldest bucket to make room for new data.
It looks something like this:
<click>?
It does rely on diskIO, as the database is not “clean” until the drop Database command is executed.
Generator creates new buckets which are big enough to hold a “substantial” amount of data.
Substantial is an objective measure that will need to take into account the hardware mongo is running on.
This bucket must be small enough so that the entire bucket, data and indexes and all, fits into RAM.
It should actually be small enough so that in addition to completely fitting in ram, at least one other bucket can be paged into RAM as well so concurrent queries aren't fighting for physical resources.
It should be big enough such that you get a marked improvement in your “capped” collection.
Remember, the “floor” of this concept is almost as inefficient as the capped collection. If you make make each bucket big enough to fit one document, the only thing you are really only trading the B-TREE index manipulation for destructor disk IO
Working set
Only one database is “heated” with inserts
Only one database must be in RAM, others may go in and out with queries/map reduces.
Again, In a perfect world, the rate at which I can throw documents at MongoDB would not be in any way related to total database size.
As the total data size goes up, and the capped collection kicks in:
<click>
Remember, this is real disk backing this, so we must be absolutely 100% sure the database size always fits in the allocated disk space.
We don’t have room to “accidently” raise the blue line.
Same from before. It would be really nice if the documents per second would stay constant.
<click>
Remember, we just settled for the diminished throughput because we couldn’t afford to scale up the node to the point our data would fit in RAM, making it look more like this
<click>
Then we agreed that we really like the capped collection behavior, which made our graph look more like this.
<click>
Apply a little bit of customization and we help mongo keep the most recent, dare I say the most important data in RAM.
We helped Mongo expire data, not on a per document level where it was forced to manage its giant B-TREE index, but on a “bucket” level, allowing Mongo to throw its data to the OS for removal.
Its noisy, but in general it maintains performance regardless of data size and whether it is working to prune data.
With only a little bit of effort, we scaled-out a single mongod, without scaling-up any of the physical resources.
The system hums along happily when each bucket is sized somewhere between one-quarter to one-third RAM.
This is MongoR implementation specific!
Follow standard Mongo practices like fstab parameters and numa control first.
Assumes the client application itself isn’t a HEAVY user of RAM.
Our implementation has the client application poll MongoR, checking if there is a “new” database handle.
If there are many client applications on each box, their polling may not be aligned and whenever that rotate happens, there may be 2 ‘active’ buckets for a period of minutes or hours or depending on how often the client polls for a new database.
There needs to be headroom for queries / aggregation / map reduce for the historical data to be pulled up into RAM without kicking the “warm” data out of RAM.
Remember the Saw tooth pattern
We set the rotation to occur overnight to have minimal impact around the possibility of 2 active buckets.
MongoDB allocates 2GB buckets at a time, so if the system absorbs more than 2GB between rotation checks, it could go “over” the allocated space.
We found no limit to the number of buckets. We run about 100 30GB buckets per pizza-box.
What do we want?
I don’t think MongoR is the end-all solution to this problem. When I built this system, I got the feeling that a lot of people had this problem, but everyone dealt with it separately.
We formed a great relationship with our MongoDB contact who gave us enough hints that we should be concerned with our working set.
This behavior is valuable to us, I hope the behavior is valuable to others. The best case scenario is that we convince MongoDB to build a behavior set like this directly into MongoDB so I can abandon my implementation.