M202 July 2014 Final weeks

The first few videos of week 7 covered the write concern and some important changes to the basic parts of the system since version 2.4. It is good these subjects are explicitly covered, because MongoDB is going through a lot of functional upgrades. Without the emphasis, you might not realize there is now a more elegant way to do things.

My personal opinion is that write concern should hold no mystery this late in the course. As the other parts, it’s information you could find in the release notes.

Snapshot of operations

The next topic of the videos was interesting, but could have had more depth to it. A video was spent on the database command currentOp which will spew data on the operations of the current mongod process. This command generates a lot of data, but the important thing to remember is and most importantly is a snapshot in time. The videos went over some basic scenarios on how to use the database command and the reliability of the information.

This is all nice, until your system has a heavy load and capturing that problem becomes too difficult for a (wo)man and a simple shell. That’s when you need tools that are able to capture and process faster then most humans can do.


During the course there were some videos on mtools. If you wouldn’t know it’s related to MongoDB then you stumble onto several other software tools and packages with that name. This gets confusing when you search your favorite package manager, you will probably not find the tool you’re looking for.

The mtools mentioned in this course are a collection tools written in Python to simplify several tasks when experimenting with MongoDB or analyzing its output. This software isn’t officially supported by MongoDB Inc., but was written by a current employee and found it’s way across his peers.

The source and installation can be found on GitHub. You might have some trouble setting up the software correctly, because sometimes library dependencies not enforced to a particular version or need to be forcefully updated. This can be very frustrating and give the idea the tools don’t work, where installation of all the right versions of dependencies will make it work as intended.

So what can mtools do? It can set up basic architectures of MongoDB systems. For instance, it’s interface to setting up a sharded cluster is quite nice. The course material demonstrated it and I prefer the mtools way to the mongo shell’s ShardingTest object. Both mtools and the mongo shell can only start processes on the local system, which is good enough for experiments.

It can also filter, combine and plot log data. There are a lot of tools that can do that, but mtools doesn’t need tweaking to understand MongoDB’s log format and comes with some default reports that make sense. That being said, there are definitely tools that have a sleeker appearance, create more attractive reports or just integrate better with other tools.

If I would be faced with an immediate problem in a MongoDB cluster, then churning through the logs with mtools would be my preference. MongoDB Management Service be a second choice, because it takes some time for data to show up and it’s not as detailed. In time I would be looking for other tools to specifically suit what I need as a tool.

M202 July 2014 Fifth and sixth week

My first plan for this series was to get a blog post for each week. It didn’t work out in the past and because both the fifth and sixth week cover sharding, I decided to make one big post on sharding.

Sharding is used to scale horizontally. Instead of upgrading a MongoDB replica set with more computing or network capacity, you add more replica sets and divide the load among them. Each replica set will have a shard of the entire data. It’s not mandatory to use replica sets for your shards, single servers will do. However, when you consider sharding it’s very likely you want the redundancy and failover of replica sets as well. The reason for this is that once a shard fails, unreachable from any kind of perspective, the data in that shard will be unavailable for both reads and writes. With MongoDB’s failover functionality this becomes less of an availability problem.

During previous courses the basic steps on sharding databases and collections in MongoDB were already explained. The M202 course assumes you won’t need to dig deep for this knowledge, but if you do there’s always the tutorial in the MongoDB documentation. In my opinion sharding data got a little bit easier in version 2.6. The first step is to orchestrate all the pieces of the infrastructure, which could be simplified with configuration management tools. After that you need to explicitly enable sharding on databases and collections.

Mongos process

The recommended way to communicate with the cluster is through one of the mongos processes. These processes provide several tasks in the system. The most important task is to provide a proxy for connections made to other parts of the system. When clients query through a mongos process their queries will be routed to the right shards and their results will be merged before returning to the client. This makes the cluster behave as one big database, although there are limits on the amount of data a mongos process can reasonably sort and process.

Another important task of the mongos process is to provide an interface to manage the cluster. Through one of the mongos processes you are able to configure which databases and collections are sharded, what their shard key is and how data is distributed between shards. Once data is in the cluster, any mongos process is able to push it around by updating the cluster configuration and kicking off the necessary commands on the shards. In order to keep the configuration up to date these mongos processes communicate with the config servers.

Config servers

Config servers are separate mongod processes dedicated to holding metadata on the shards of a cluster. This metadata tells mongos processes which database and collections are sharded, how they are sharded and several other important metadata. Typical production environments will have three config servers in the cluster. This number coincidentally is similar to the recommended size of replica sets, but config servers never form a replica set. Instead they are three independent servers synchronized by mongos processes. All instances should be available, if one of them drops the metadata will be read-only in the cluster.

The videos put emphasis on monitoring config servers. Metadata of a cluster is kept in a special database you’re never supposed to touch unless instructed to do so. The videos make this very clear, and in general it’s a good idea because a mismatch in any of the data might make the data out of sync with the rest of the config servers. If the metadata can’t be synchronized to all of them, then data chunks can’t be balanced and the config server that’s out of sync will be considered broken. Manual intervention is needed to fix a failing config server.

Using ShardingTest

One of the most useful improvements to the Mongo shell that caught my attention was the ability to set up a sharded cluster with relative ease using a ShardingTest object. Of course, this is only for testing purposes because the complete cluster will be running on your current machine. During the fifth and sixth week of the course I made a habit of using tmux with a vertical split so that on the one side I could run the test cluster and on the other I had my connection to the mongos to try out commands.

First, you need to start the cluster. You’ll start with invoking the mongo shell without connection to the database (mongo --nodb). Then do something similar to the code below, which I took from the course and formatted a bit, to start a small cluster for testing purposes.

config = {
  d0 : { smallfiles : "",
         noprealloc : "",
         nopreallocj: "" },
  d1 : { smallfiles : "",
         noprealloc : "",
         noprealloc : "" },
  d2 : { smallfiles : "",
         noprealloc : "",
         nopreallocj: "" } };
cluster = new ShardingTest( { shards : config } );

Once you hit return on that last line (11) your screen will fill up with lots of logging which most of the time is just the different nodes booting and connecting into a sharded cluster system. You could wait until the stream slows down, but the mongos node might already be running at port 30999.

When you’re done, interrupt the mongo shell where you started the cluster and it will shutdown everything. I tried scripting this and making the mongo shell execute it, but once it hits the end of the script it will shut down the cluster. I see no other way than running this while keeping the mongo shell opened.

M202 July 2014 Fourth week

The fourth week of the course covered fault tolerance and availability. The main purpose of the week was to highlight and deal with network traffic to the MongoDB cluster or replica set.

Availability during maintenance

The week started with a few videos on doing maintenance on a replica set. One such task would be updates to the MongoDB server, perhaps an upgrade from 2.4 to 2.6. Depending on the application requirements and permitted downtime, pulling down nodes in a production cluster for maintenance proves more complex than I thought.

The key to minimising the impact of nodes going down is proper planning. It starts with scheduling the update during a maintenance period. Then you make a backup plan, a good practice which usually seems a waste of time until you need to roll back. One the time is right, you roll through your nodes and perform the updates. It’s important to do the primary as last node. Stepping down the primary node should ideally be the most pain in the entire operation, because there will be downtime until a new primary is elected.

After watching the videos I believe the following steps will yield the best result:

  1. Log in to the node
  2. If this is a primary, step it down and wait for the election to finish
  3. issue a db.shutdownServer() command
  4. Do maintenance; there are several paths but the most important thing is that during maintenance the node must not be started with production parameters. This means a different port, but also that it shouldn’t get the --replSet parameter.
  5. Restart the node with production parameters
  6. Wait for it to catch up and move on to the next node

These steps should get you through any maintenance on live systems without worrying too much about the details of availability to the replica set.

Then a frequently asked question was answered: can you use a load balancer in front of MongoDB? From what I understood, you can try but most of the time this will cause a problem. Clients who started a session, for instance not getting all the data in a cursor, will get an error because this kind of state isn’t shared between nodes. I ground my teeth and went on with the next videos.

Managing connections

What followed was a bunch of videos about connection options on various drivers, with a focus on the Java driver. Those videos made me doze off a little, so when I got to the videos about managing connections I was shocked about how quickly connections and memory usage (since every connection reserves roughly 1MB) stacked up.

Connections are coming in from everywhere. There’s clients connecting to the primary node, but in some configurations to secondaries too; there are heartbeats between all nodes in a replica set; there’s a few connections between a node and the node it is synchronizing from; finally a monitoring agent might connect to all nodes as well. Most of this traffic will go to the primary, because of the clients defaulting to it for reads and definitely connecting to it for writes.

In a cluster there is even more traffic: connections from mongos servers, connections to config servers and connections between shards to balance out the data add up quickly. More connections is more network traffic and more memory needed to manage the connections. This means less memory for data. A new problem has risen. Then the videos were so kind to mention that once heartbeat responses are low, elections might be held and the new primary might get as overwhelmed as the old one did. Scary stuff.

This week taught me the way to get more predictability is by enforcing a maximum number of connections on the mongos nodes. By assigning a explicit value, the mongos process will act as a sort of safeguard to the system and prevent the number of connections to a primary from being a factor in availability problems.

During the course a formula was given to calculate the setting for the mongos process parameter. The presented formula is as follows:

desired max. connections on primary - (number of secondaries * 3) - (number of other processes * 3) / number of expected mongos processes

Where the variable for other processes is defined by the number of processes other than mongos, mongod, config server or mongo client. Basically this means any monitoring agents.


The first thing I didn’t like was the magic number 3. It popped out immediately and I couldn’t think of any other reason than it standing for the number of nodes in a replica set. Then there’s probably a reasonable amount of what, for lack of better words, I call dark connections. They’re network connections in the system but not something covered in the material of the course. I guess that’s why there’s a mentioning of a safeguard percentage on the first variable in the formula.

If you’d follow the recommendations in the homework assignment, you would end up with 90% of the desired max. connections as your first variable. It’s a rough guess to have at least some buffer. In the end you should end up adjusting the parameters based on what your monitoring tools will tell you.

There was a recap of the configuration options for read preferences. Sadly it didn’t tie in with the rest of the videos in a meaningful way for me.


Rollbacks automated when the amount of rollback data is less than 300 Mb. When the amount of rolled back data exceeds that limit the node will not automatically re-join the replica set. This means that somehow you should be able to monitor that: 1) rollback is needed, 2) it occurred automatically or manual intervention is needed.

After a rollback some data is left on the node, which has to be examined manually because it was thrown out of the node that re-entered the replica set.

This video seemed like another loose end to me, but there was a blog plug attached on how to simulate a rollback and the method was still relevant for version 2.6.  In the video a different approach was used, because it’s possible to create a ReplSetTest in the Mongo shell, which has some handy tricks to speed up the setup of various replica set configurations.

Last stretch of the week

There was a video on the existing states a mongod process could be in. At the moment there are eleven. Some feedback was given, I probably forget it if I won’t read it back several times. Then it was on to the homework.

I noticed the answer video of the first homework assignment (4.1) was already incorrectly linked to a quiz in the course. Others noticed it as well. I did try the assignment after all, but the excitement of finding the answer by myself was gone. This led to me having a fit about professionalism and such but I kept that rant to myself (neighbours probably didn’t hear my thoughts). On to the other assignments it was.

It wasn’t too difficult to complete this week, despite believing there’s a bug in the MongoProc tool. Somehow I managed to get a node stuck trying to roll back data. Turns out all the parameters for the node resembled what the run replica set expected, but the data didn’t. The new node kept looking for a common point which it was never going to find. I guess this experience is the same as falling off the oplog.

I am glad that it’s possible to watch and learn at my convenience, because if this was a course with set times then it would be much harder to finish at the moment.

M202 July 2014 Third week

The third week’s topic was disaster recovery and backup.

Disaster recovery

The first videos covered what disaster recovery is. There was an emphasis on how to configure the MongoDB replica sets and clusters, those being the most common production systems.

There are two ways in dealing with network or hardware failure, the first being manual and the second automatic. From the videos I learned that at the moment automatic failover requires a minimum of three distinct network groups. For convenience the videos used the term data centre, but any configuration that can keep at least two network groups up should suffice. The reason being that elections require a majority vote or none of the remaining systems will automatically get promoted to primary.

So the key to a successful disaster recovery is: location, location, location. When dealing with the smallest recommended number of nodes, being three nodes, one location holds the primary, the other two either each have a secondary node or one has a light-weight arbiter. Additional mongod processes should be appropriately scattered across these three locations.

When it comes to config servers the minimum number that needs to be up is one, even if the system is truly healthy when three are up. Three is the maximum number of config servers, which means that in a scenario of more than three locations there will be locations without config servers.


The videos explained several ways to do backups. You can either do live backups (with slightly performance degradation in certain scenarios) or take down instances and use more conventional backup processes.

The recommended scenario is of course to do a backup while the system is live. This so-called point-in-time backup is supported by many systems, for instance the LVM layer in Linux and the EBS volumes in Amazon.

The other scenario is to stop writing to disk using the db.fsyncLock() and then copy data to a backup location using conventional tools such as cp, scp or rsync. Although some file systems provide a way to become read-only, in case you want to backup more than just mongod data, it’s necessary to explicitly tell the mongod process to stop attempting writes to disk. Otherwise lots of pointless errors will follow.

When you make a backup of data, you will usually make a backup of the oplog as well. The oplog will help get a restored system up to speed. Without a backup you’re crippling the restore process, increasing the chance of failure.

Backing up with RAID

A special scenario  was the backup of data stored on a RAID configuration. Depending on the exact timing of the backup data across the RAID might be in an inconsistent state. More specifically, part of the stripe in a striped volume might not be written yet. This chance increases with the number of disks. Solutions to this are in more abstract layers such as LVM that, according to the videos, can guarantee point-in-time. There’s a special case for RAID 10 on EBS volumes in the MongoDB documentation, since this seems is a difficult task.

Test your restore process

What often happens is that people make backups and they feel safe. They are right, it is safe to have a backup, but pointless if you can’t restore from it (fast enough).

The most important part of backups is that you can always restore from them to get your data back to a known state. This goes for the latest backup to the oldest one. One of the videos mentioned a situation where people did test restores from the latest backup, but weren’t able to restore the backup from last week because the compression software contained a bug that crippled the data.

In case of MongoDB you want to make sure your restore is fast enough. For this you need to take in consideration not only how fast the backed up data is moved to a (new) location, but also how fast it will be available and caught up with the rest of a replica set.

The time available for the restore process is mainly dictated by the amount of time between the first and last operation in the replica set’s oplog, also referred to as the oplog window. The course recommends a decent buffer so that there’s time for error. Useful techniques to at least warm up the node, and speed up the restore process, were covered in week 2 during the pre-heating videos.

Mongodump and it’s use

In the second week I posted a question on the discussion forum. My question was why mongodump/mongorestore was not mentioned as a way to shrink disk usage. The teaching assistant said the discussion could be picked up after the completion of the second week, but that didn’t happen. As I expected there would be some sort answer in the third week’s videos, although not as complete as you might hope.

So mongodump is an excellent tool to dump the contents of collections either online on a live system or offline when data files are accessible. It can even record some of the oplog to replay later. There are however a few disadvantages to a dump with this tool:

  • indexes are only stored as descriptions. They need to be rebuild after restoring, because their data isn’t dumped
  • on a live system running mongodump will page all data into memory, which will likely force other data out
  • on large data sets creating a dump will take a lot of time, the same goes for restoring large datasets

In my mind collection compaction and repairDatabase() also take a lot of time and memory, with the difference they at least seem to keep indexes intact.

Backup and MMS

Mongo Management Service (MMS) has some functionality to do backups as well. This SaaS solution creates several backups with the option to request point-in-time restores. These are created from the last backup patched with replication oplog data until the requested time. Currently data is not encrypted at rest, because that obstructs the point-in-time restore in some way. The only layer of security being that there’s two-factor authentication needed to restore data. There is a possibility to get the same functionality on-premise with the enterprise edition of MongoDB.


The homework of week 3 looked daunting at first and there were some tough questions. With some logical thinking and making notes on paper I made it through.

This week gave me new insights in not just how to manage MongoDB, but also on how to improve my skills in general. The fourth week will cover fault tolerance and availability. I am looking forward to what I might learn then.

M202 July 2014 Second week

My assumption was that the second week would be all about tuning and sizing the system. When I glanced through the video titles it seemed a lot of configuration and it’s impact would be presented.

Judging by the content of the syllabus, completion of this week would probably leave me with insights into, amongst others, the performance impact of the oplog, virtualization, geography, file systems and hardware on MongoDB.

While watching the videos of the second week, I noticed there are lots of pointers to external information on specific hardware configurations or file systems. There was much emphasis on the usefulness of the (latest) MongoDB production notes when configuring your systems.

Perhaps I will look into more of those informational sources at some later time, but the pressure I put on myself to finish the week didn’t allow for delving deep into some of that external material. My decision to finish the first weeks videos first and some private troubles with staying awake made it harder actually finish the week on time.

MongoDB memory

The video modules had some clear chapters, the first being about MongoDB itself. The videos in this section mainly covered the usage of memory by MongoDB and how it displays in MMS.

The most important data in resident memory, all memory allocated to MongoDB, is the working set. This is the active data used by MongoDB, collections and indexes. Aside from that there’s process data which has stuff like connections, journaling, etc.

MongoDB uses mmap to load data into memory and on top of that Linux usually likes to load as much as possible into file system cache as well. When something isn’t found in memory it causes a hard page fault, opposed to a soft page fault which occurs when it’s in memory but not associated with MongoDB yet. Hard page faults in general cause performance issues, that’s why there are some videos on how to pre-heat using either touching a collection or, more efficiently, using targeted queries that will load a good chunk of working set into memory.

Choosing disks

There were some videos on the benefits and choices to make when selecting the storage solution(s) for your data. Most data access in MongoDB is random, which means SSD or RAID increases performance because they have lower seek times. Capped collections, which include the special oplog collection, are usually sequentially stored and could be efficiently stored on spinning disks. Separating the data over different types of storage is possible by specifying parameters on the mongod process.

Network storage usually has penalties because of the extra overhead of the network on whatever storage media used to back it.

Choosing file systems

MongoDB tends to allocate a lot of space which it might not use right away. EXT4 or XFS are the recommended file systems because they support a low level call to allocate space, while EXT3 requires zero-ing out space to actually allocate it. There was a brief mention of BtrFS and ZFS, but these file systems aren’t thoroughly tested by MongoDB Inc. at the moment.

There’s one video explaining why and how to use swap partitions, but in short you want to use and configure it properly on Linux just to avoid mongod processes from being killed when the kernel detects the system is running out of memory.

Configuring file systems

Even if you pick a recommended file system, it still needs some tweaking to configure it optimally for MongoDB. In particular you should disable atime on the mount, because it slows down performance and serves no benefit in case of MongoDB.

Managing file systems

Some of the videos covered how MongoDB uses disk space — allocating ahead of using it to avoid some penalties — and how to reclaim it.

By default a mongod process it will easy start allocating in excess of 3 Gb of disk space without actually putting data in. There are options to configure this according to the situation, which in some cases is advisable.

As the collection is written to data in the data files might get fragmented. This could lead to additional disk usage and performance degradation as well when memory is restricted. One of the ways to remedy fragmentation is by compacting the fragmented collection(s) including existing indexes. MongoDB will then block activity for the entire database and starts moving data around. A possible side-effect is that the data files use more disk space after completion.

The curriculum only mentions using repairDatabase() on a stand-alone MongoDB or a rolling rebuild of nodes in a replica set. I believe mongoexport is an option as well.

Final videos

The last videos covered the topics of virtualization and replica sets bigger than three nodes.

Although the support on common virtualization systems such as VMWare and Xen is well established, recently the options of jails and containers gained popularity. These last two options are at the moment less explored and supported in official ways. The main point made in the videos was that virtualization gives you a way to split up hardware into more manageable chunks. If needed, you will have the benefit of relocating particularly busy virtualized systems to dedicated hardware should your virtualization option support this.

Bigger replica sets

Official documentation recommends replica sets of three nodes in either primary-secondary-secondary or primary-secondary-arbiter configuration. Of course it’s possible and documented to use bigger replica sets, but there are limits to the total size and number of voters in a set. The last two videos cover replica sets bigger than three nodes and why you might want to use those configurations.

As an example the course went through implications of a geographically distributed replica set. It’s key to consider network latency and failover scenarios in this case as well as the, new for me, option of chaining in the replica set. Although the option to specify a replica set member to synchronize from has been available since version 2.2 and my first MongoDB version was 2.4, I never came across this option before. This was also the first time I heard of new members in a replica set picking the closest member as the one to start synchronizing from.

This second week was a lot to take in, but not overwhelming. I guess the most benefit is, as always, in going through the materials and concepts referenced in the video. Read more, practice more and actually try to delve deeper into the resources than you would by simply completing the course.

Catching up to week 1 videos

During week one I had trouble with a homework assignment. This meant I skipped some videos. Normally the last section you visited is where you resume, but in the rush to get the homework done I switched around too much to know where to continue with the videos. Turns out I was at the part where configuration of alerts is covered.

The videos were based on the blog post “Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track“. The blog post is focused on the Mongo Monitoring Service, but in concept are easily applicable to just about any alert configuration. You should figure out what is “normal” behaviour, determine a warning threshold and know what condition should escalate the alert. The goal is to achieve a notification setup without noise (alerts for situations that are normal), but still able to warn on incidents and sirens going off when disaster strikes.

After the videos about alerts there was some information on properly setting up a managable overview of your configuration in MMS. For instance, you might want to split up all your servers into groups to maintain a sane overview and avoid pagination of servers. There was also advice on setting up users to those groups. This was al very useful information which I will hopefully remember once I get to the large numbers that require this kind of separation.

The final videos were about the usage of netstat, iostat and the mongo db.serverStatus() output. MMS is a great tool to get an overall view of your system’s health and monitor it in a decent way. There is one drawback, the data is collected/sent once a minute (or less frequent in case of network problems). So during troubleshooting you need to rely on tools and information locally available on the system.

I knew a little about netstat, but never went through all the parameters. Knowing that in the case of MongoDB it’s useful to keep an eye on the aggregate statistics netstat -s generates and in particular which fields to monitor was an excellent guide to knowing how to see network problems in progress.

Honestly I was not aware of the iostat command and it’s possibilities. Not having any formal training in Linux administration, this command and the information giving in the video showed me that it’s in general a useful troubleshooting aid for systems with bad performance. What it can do is show statistics on I/O performance which might help locate bottlenecks and see what is causing the bad performance.

The last video was about the uses of the db.serverStatus() command. MMS actually uses this command for lots of information, which led me to read the serverStatus page in the MongoDB documentation. There I found what I was looking for, the explanation to the metrics I didn’t quite understand during this first week’s overview of MMS. I’m glad I didn’t skip any of these videos.

By watching these videos I had fully completed the first week of M202 and was ready for the second week.

M202 July 2014 First week

The course has started and immediately I noticed how much I already forgot. I haven’t practiced my MongoDB DBA skills and feel embarrassed at how little I do remember.

Mongo University changes

The Mongo University changed a lot since the last time I took a course. Looking back through my blog posts it has been almost a year since I completed a course, slightly less on obtaining a Mongo DBA certificate (free from the pilot program).

In the updated M102 course there’s a chapter dedicated to the MongoDB  Monitoring Management Service (MMS). Another change to Mongo University courses is a tool called MongoProc. It’s a graphical improvement over the Python scripts that used to verify results. Yeah, I would say it looks more professional even if the GUI still isn’t as polished as it could be.

The lessons

At first I guessed all lessons starting with MMS in the title could just be from M102. Once I started the best practices video I knew it was a M202 lesson already.

After going through the videos I got a basic idea of what’s going on. It was hard to find an online description or proper documentation on monitoring and best practices so I hope this course gets me that information. Somehow I feel I actually need to retake the M102 course, there’s probably a lot of changes to it by now.

Doing the homework

The first thing I did was log in and try to pass installing MongoProc, this assignment is a quiz so there’s no limit of tries. I was afraid to immediately hit the test button of the first homework assignment, because I wasn’t sure if it will count as a try even though there’s a “turn in” button next to it. There’s only three tries, don’t wanna waste any this early just to test what a button does.

Perhaps I was tired, but I had trouble figuring out how the instructor got the replica set in the first homework assignment. I ended up matching my setup to what I saw in the video.

Over the weekend I struggled with homework assignment 1.2 which generates some data for the metrics in MMS. It seemed not a lot was happening and the feedback of the MongoProc tool is insufficient enough to make you worry if things are still working. It’s when I was about to give up on this course that I switched from a Mac host for the given VM to a Windows host. This seemed to work better, but I threw away the previous data so I can’t compare it anymore. At least I got to pass all homework assignments.

The start of this course was a little unpleasant for me, hopefully this feeling won’t stay throughout the coming weeks.