M202 July 2014 Fourth week

The fourth week of the course covered fault tolerance and availability. The main purpose of the week was to highlight and deal with network traffic to the MongoDB cluster or replica set.

Availability during maintenance

The week started with a few videos on doing maintenance on a replica set. One such task would be updates to the MongoDB server, perhaps an upgrade from 2.4 to 2.6. Depending on the application requirements and permitted downtime, pulling down nodes in a production cluster for maintenance proves more complex than I thought.

The key to minimising the impact of nodes going down is proper planning. It starts with scheduling the update during a maintenance period. Then you make a backup plan, a good practice which usually seems a waste of time until you need to roll back. One the time is right, you roll through your nodes and perform the updates. It’s important to do the primary as last node. Stepping down the primary node should ideally be the most pain in the entire operation, because there will be downtime until a new primary is elected.

After watching the videos I believe the following steps will yield the best result:

  1. Log in to the node
  2. If this is a primary, step it down and wait for the election to finish
  3. issue a db.shutdownServer() command
  4. Do maintenance; there are several paths but the most important thing is that during maintenance the node must not be started with production parameters. This means a different port, but also that it shouldn’t get the --replSet parameter.
  5. Restart the node with production parameters
  6. Wait for it to catch up and move on to the next node

These steps should get you through any maintenance on live systems without worrying too much about the details of availability to the replica set.

Then a frequently asked question was answered: can you use a load balancer in front of MongoDB? From what I understood, you can try but most of the time this will cause a problem. Clients who started a session, for instance not getting all the data in a cursor, will get an error because this kind of state isn’t shared between nodes. I ground my teeth and went on with the next videos.

Managing connections

What followed was a bunch of videos about connection options on various drivers, with a focus on the Java driver. Those videos made me doze off a little, so when I got to the videos about managing connections I was shocked about how quickly connections and memory usage (since every connection reserves roughly 1MB) stacked up.

Connections are coming in from everywhere. There’s clients connecting to the primary node, but in some configurations to secondaries too; there are heartbeats between all nodes in a replica set; there’s a few connections between a node and the node it is synchronizing from; finally a monitoring agent might connect to all nodes as well. Most of this traffic will go to the primary, because of the clients defaulting to it for reads and definitely connecting to it for writes.

In a cluster there is even more traffic: connections from mongos servers, connections to config servers and connections between shards to balance out the data add up quickly. More connections is more network traffic and more memory needed to manage the connections. This means less memory for data. A new problem has risen. Then the videos were so kind to mention that once heartbeat responses are low, elections might be held and the new primary might get as overwhelmed as the old one did. Scary stuff.

This week taught me the way to get more predictability is by enforcing a maximum number of connections on the mongos nodes. By assigning a explicit value, the mongos process will act as a sort of safeguard to the system and prevent the number of connections to a primary from being a factor in availability problems.

During the course a formula was given to calculate the setting for the mongos process parameter. The presented formula is as follows:

desired max. connections on primary - (number of secondaries * 3) - (number of other processes * 3) / number of expected mongos processes

Where the variable for other processes is defined by the number of processes other than mongos, mongod, config server or mongo client. Basically this means any monitoring agents.


The first thing I didn’t like was the magic number 3. It popped out immediately and I couldn’t think of any other reason than it standing for the number of nodes in a replica set. Then there’s probably a reasonable amount of what, for lack of better words, I call dark connections. They’re network connections in the system but not something covered in the material of the course. I guess that’s why there’s a mentioning of a safeguard percentage on the first variable in the formula.

If you’d follow the recommendations in the homework assignment, you would end up with 90% of the desired max. connections as your first variable. It’s a rough guess to have at least some buffer. In the end you should end up adjusting the parameters based on what your monitoring tools will tell you.

There was a recap of the configuration options for read preferences. Sadly it didn’t tie in with the rest of the videos in a meaningful way for me.


Rollbacks automated when the amount of rollback data is less than 300 Mb. When the amount of rolled back data exceeds that limit the node will not automatically re-join the replica set. This means that somehow you should be able to monitor that: 1) rollback is needed, 2) it occurred automatically or manual intervention is needed.

After a rollback some data is left on the node, which has to be examined manually because it was thrown out of the node that re-entered the replica set.

This video seemed like another loose end to me, but there was a blog plug attached on how to simulate a rollback and the method was still relevant for version 2.6.  In the video a different approach was used, because it’s possible to create a ReplSetTest in the Mongo shell, which has some handy tricks to speed up the setup of various replica set configurations.

Last stretch of the week

There was a video on the existing states a mongod process could be in. At the moment there are eleven. Some feedback was given, I probably forget it if I won’t read it back several times. Then it was on to the homework.

I noticed the answer video of the first homework assignment (4.1) was already incorrectly linked to a quiz in the course. Others noticed it as well. I did try the assignment after all, but the excitement of finding the answer by myself was gone. This led to me having a fit about professionalism and such but I kept that rant to myself (neighbours probably didn’t hear my thoughts). On to the other assignments it was.

It wasn’t too difficult to complete this week, despite believing there’s a bug in the MongoProc tool. Somehow I managed to get a node stuck trying to roll back data. Turns out all the parameters for the node resembled what the run replica set expected, but the data didn’t. The new node kept looking for a common point which it was never going to find. I guess this experience is the same as falling off the oplog.

I am glad that it’s possible to watch and learn at my convenience, because if this was a course with set times then it would be much harder to finish at the moment.


M202 July 2014 Third week

The third week’s topic was disaster recovery and backup.

Disaster recovery

The first videos covered what disaster recovery is. There was an emphasis on how to configure the MongoDB replica sets and clusters, those being the most common production systems.

There are two ways in dealing with network or hardware failure, the first being manual and the second automatic. From the videos I learned that at the moment automatic failover requires a minimum of three distinct network groups. For convenience the videos used the term data centre, but any configuration that can keep at least two network groups up should suffice. The reason being that elections require a majority vote or none of the remaining systems will automatically get promoted to primary.

So the key to a successful disaster recovery is: location, location, location. When dealing with the smallest recommended number of nodes, being three nodes, one location holds the primary, the other two either each have a secondary node or one has a light-weight arbiter. Additional mongod processes should be appropriately scattered across these three locations.

When it comes to config servers the minimum number that needs to be up is one, even if the system is truly healthy when three are up. Three is the maximum number of config servers, which means that in a scenario of more than three locations there will be locations without config servers.


The videos explained several ways to do backups. You can either do live backups (with slightly performance degradation in certain scenarios) or take down instances and use more conventional backup processes.

The recommended scenario is of course to do a backup while the system is live. This so-called point-in-time backup is supported by many systems, for instance the LVM layer in Linux and the EBS volumes in Amazon.

The other scenario is to stop writing to disk using the db.fsyncLock() and then copy data to a backup location using conventional tools such as cp, scp or rsync. Although some file systems provide a way to become read-only, in case you want to backup more than just mongod data, it’s necessary to explicitly tell the mongod process to stop attempting writes to disk. Otherwise lots of pointless errors will follow.

When you make a backup of data, you will usually make a backup of the oplog as well. The oplog will help get a restored system up to speed. Without a backup you’re crippling the restore process, increasing the chance of failure.

Backing up with RAID

A special scenario  was the backup of data stored on a RAID configuration. Depending on the exact timing of the backup data across the RAID might be in an inconsistent state. More specifically, part of the stripe in a striped volume might not be written yet. This chance increases with the number of disks. Solutions to this are in more abstract layers such as LVM that, according to the videos, can guarantee point-in-time. There’s a special case for RAID 10 on EBS volumes in the MongoDB documentation, since this seems is a difficult task.

Test your restore process

What often happens is that people make backups and they feel safe. They are right, it is safe to have a backup, but pointless if you can’t restore from it (fast enough).

The most important part of backups is that you can always restore from them to get your data back to a known state. This goes for the latest backup to the oldest one. One of the videos mentioned a situation where people did test restores from the latest backup, but weren’t able to restore the backup from last week because the compression software contained a bug that crippled the data.

In case of MongoDB you want to make sure your restore is fast enough. For this you need to take in consideration not only how fast the backed up data is moved to a (new) location, but also how fast it will be available and caught up with the rest of a replica set.

The time available for the restore process is mainly dictated by the amount of time between the first and last operation in the replica set’s oplog, also referred to as the oplog window. The course recommends a decent buffer so that there’s time for error. Useful techniques to at least warm up the node, and speed up the restore process, were covered in week 2 during the pre-heating videos.

Mongodump and it’s use

In the second week I posted a question on the discussion forum. My question was why mongodump/mongorestore was not mentioned as a way to shrink disk usage. The teaching assistant said the discussion could be picked up after the completion of the second week, but that didn’t happen. As I expected there would be some sort answer in the third week’s videos, although not as complete as you might hope.

So mongodump is an excellent tool to dump the contents of collections either online on a live system or offline when data files are accessible. It can even record some of the oplog to replay later. There are however a few disadvantages to a dump with this tool:

  • indexes are only stored as descriptions. They need to be rebuild after restoring, because their data isn’t dumped
  • on a live system running mongodump will page all data into memory, which will likely force other data out
  • on large data sets creating a dump will take a lot of time, the same goes for restoring large datasets

In my mind collection compaction and repairDatabase() also take a lot of time and memory, with the difference they at least seem to keep indexes intact.

Backup and MMS

Mongo Management Service (MMS) has some functionality to do backups as well. This SaaS solution creates several backups with the option to request point-in-time restores. These are created from the last backup patched with replication oplog data until the requested time. Currently data is not encrypted at rest, because that obstructs the point-in-time restore in some way. The only layer of security being that there’s two-factor authentication needed to restore data. There is a possibility to get the same functionality on-premise with the enterprise edition of MongoDB.


The homework of week 3 looked daunting at first and there were some tough questions. With some logical thinking and making notes on paper I made it through.

This week gave me new insights in not just how to manage MongoDB, but also on how to improve my skills in general. The fourth week will cover fault tolerance and availability. I am looking forward to what I might learn then.

M202 July 2014 Second week

My assumption was that the second week would be all about tuning and sizing the system. When I glanced through the video titles it seemed a lot of configuration and it’s impact would be presented.

Judging by the content of the syllabus, completion of this week would probably leave me with insights into, amongst others, the performance impact of the oplog, virtualization, geography, file systems and hardware on MongoDB.

While watching the videos of the second week, I noticed there are lots of pointers to external information on specific hardware configurations or file systems. There was much emphasis on the usefulness of the (latest) MongoDB production notes when configuring your systems.

Perhaps I will look into more of those informational sources at some later time, but the pressure I put on myself to finish the week didn’t allow for delving deep into some of that external material. My decision to finish the first weeks videos first and some private troubles with staying awake made it harder actually finish the week on time.

MongoDB memory

The video modules had some clear chapters, the first being about MongoDB itself. The videos in this section mainly covered the usage of memory by MongoDB and how it displays in MMS.

The most important data in resident memory, all memory allocated to MongoDB, is the working set. This is the active data used by MongoDB, collections and indexes. Aside from that there’s process data which has stuff like connections, journaling, etc.

MongoDB uses mmap to load data into memory and on top of that Linux usually likes to load as much as possible into file system cache as well. When something isn’t found in memory it causes a hard page fault, opposed to a soft page fault which occurs when it’s in memory but not associated with MongoDB yet. Hard page faults in general cause performance issues, that’s why there are some videos on how to pre-heat using either touching a collection or, more efficiently, using targeted queries that will load a good chunk of working set into memory.

Choosing disks

There were some videos on the benefits and choices to make when selecting the storage solution(s) for your data. Most data access in MongoDB is random, which means SSD or RAID increases performance because they have lower seek times. Capped collections, which include the special oplog collection, are usually sequentially stored and could be efficiently stored on spinning disks. Separating the data over different types of storage is possible by specifying parameters on the mongod process.

Network storage usually has penalties because of the extra overhead of the network on whatever storage media used to back it.

Choosing file systems

MongoDB tends to allocate a lot of space which it might not use right away. EXT4 or XFS are the recommended file systems because they support a low level call to allocate space, while EXT3 requires zero-ing out space to actually allocate it. There was a brief mention of BtrFS and ZFS, but these file systems aren’t thoroughly tested by MongoDB Inc. at the moment.

There’s one video explaining why and how to use swap partitions, but in short you want to use and configure it properly on Linux just to avoid mongod processes from being killed when the kernel detects the system is running out of memory.

Configuring file systems

Even if you pick a recommended file system, it still needs some tweaking to configure it optimally for MongoDB. In particular you should disable atime on the mount, because it slows down performance and serves no benefit in case of MongoDB.

Managing file systems

Some of the videos covered how MongoDB uses disk space — allocating ahead of using it to avoid some penalties — and how to reclaim it.

By default a mongod process it will easy start allocating in excess of 3 Gb of disk space without actually putting data in. There are options to configure this according to the situation, which in some cases is advisable.

As the collection is written to data in the data files might get fragmented. This could lead to additional disk usage and performance degradation as well when memory is restricted. One of the ways to remedy fragmentation is by compacting the fragmented collection(s) including existing indexes. MongoDB will then block activity for the entire database and starts moving data around. A possible side-effect is that the data files use more disk space after completion.

The curriculum only mentions using repairDatabase() on a stand-alone MongoDB or a rolling rebuild of nodes in a replica set. I believe mongoexport is an option as well.

Final videos

The last videos covered the topics of virtualization and replica sets bigger than three nodes.

Although the support on common virtualization systems such as VMWare and Xen is well established, recently the options of jails and containers gained popularity. These last two options are at the moment less explored and supported in official ways. The main point made in the videos was that virtualization gives you a way to split up hardware into more manageable chunks. If needed, you will have the benefit of relocating particularly busy virtualized systems to dedicated hardware should your virtualization option support this.

Bigger replica sets

Official documentation recommends replica sets of three nodes in either primary-secondary-secondary or primary-secondary-arbiter configuration. Of course it’s possible and documented to use bigger replica sets, but there are limits to the total size and number of voters in a set. The last two videos cover replica sets bigger than three nodes and why you might want to use those configurations.

As an example the course went through implications of a geographically distributed replica set. It’s key to consider network latency and failover scenarios in this case as well as the, new for me, option of chaining in the replica set. Although the option to specify a replica set member to synchronize from has been available since version 2.2 and my first MongoDB version was 2.4, I never came across this option before. This was also the first time I heard of new members in a replica set picking the closest member as the one to start synchronizing from.

This second week was a lot to take in, but not overwhelming. I guess the most benefit is, as always, in going through the materials and concepts referenced in the video. Read more, practice more and actually try to delve deeper into the resources than you would by simply completing the course.

Catching up to week 1 videos

During week one I had trouble with a homework assignment. This meant I skipped some videos. Normally the last section you visited is where you resume, but in the rush to get the homework done I switched around too much to know where to continue with the videos. Turns out I was at the part where configuration of alerts is covered.

The videos were based on the blog post “Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track“. The blog post is focused on the Mongo Monitoring Service, but in concept are easily applicable to just about any alert configuration. You should figure out what is “normal” behaviour, determine a warning threshold and know what condition should escalate the alert. The goal is to achieve a notification setup without noise (alerts for situations that are normal), but still able to warn on incidents and sirens going off when disaster strikes.

After the videos about alerts there was some information on properly setting up a managable overview of your configuration in MMS. For instance, you might want to split up all your servers into groups to maintain a sane overview and avoid pagination of servers. There was also advice on setting up users to those groups. This was al very useful information which I will hopefully remember once I get to the large numbers that require this kind of separation.

The final videos were about the usage of netstat, iostat and the mongo db.serverStatus() output. MMS is a great tool to get an overall view of your system’s health and monitor it in a decent way. There is one drawback, the data is collected/sent once a minute (or less frequent in case of network problems). So during troubleshooting you need to rely on tools and information locally available on the system.

I knew a little about netstat, but never went through all the parameters. Knowing that in the case of MongoDB it’s useful to keep an eye on the aggregate statistics netstat -s generates and in particular which fields to monitor was an excellent guide to knowing how to see network problems in progress.

Honestly I was not aware of the iostat command and it’s possibilities. Not having any formal training in Linux administration, this command and the information giving in the video showed me that it’s in general a useful troubleshooting aid for systems with bad performance. What it can do is show statistics on I/O performance which might help locate bottlenecks and see what is causing the bad performance.

The last video was about the uses of the db.serverStatus() command. MMS actually uses this command for lots of information, which led me to read the serverStatus page in the MongoDB documentation. There I found what I was looking for, the explanation to the metrics I didn’t quite understand during this first week’s overview of MMS. I’m glad I didn’t skip any of these videos.

By watching these videos I had fully completed the first week of M202 and was ready for the second week.

M202 July 2014 First week

The course has started and immediately I noticed how much I already forgot. I haven’t practiced my MongoDB DBA skills and feel embarrassed at how little I do remember.

Mongo University changes

The Mongo University changed a lot since the last time I took a course. Looking back through my blog posts it has been almost a year since I completed a course, slightly less on obtaining a Mongo DBA certificate (free from the pilot program).

In the updated M102 course there’s a chapter dedicated to the MongoDB  Monitoring Management Service (MMS). Another change to Mongo University courses is a tool called MongoProc. It’s a graphical improvement over the Python scripts that used to verify results. Yeah, I would say it looks more professional even if the GUI still isn’t as polished as it could be.

The lessons

At first I guessed all lessons starting with MMS in the title could just be from M102. Once I started the best practices video I knew it was a M202 lesson already.

After going through the videos I got a basic idea of what’s going on. It was hard to find an online description or proper documentation on monitoring and best practices so I hope this course gets me that information. Somehow I feel I actually need to retake the M102 course, there’s probably a lot of changes to it by now.

Doing the homework

The first thing I did was log in and try to pass installing MongoProc, this assignment is a quiz so there’s no limit of tries. I was afraid to immediately hit the test button of the first homework assignment, because I wasn’t sure if it will count as a try even though there’s a “turn in” button next to it. There’s only three tries, don’t wanna waste any this early just to test what a button does.

Perhaps I was tired, but I had trouble figuring out how the instructor got the replica set in the first homework assignment. I ended up matching my setup to what I saw in the video.

Over the weekend I struggled with homework assignment 1.2 which generates some data for the metrics in MMS. It seemed not a lot was happening and the feedback of the MongoProc tool is insufficient enough to make you worry if things are still working. It’s when I was about to give up on this course that I switched from a Mac host for the given VM to a Windows host. This seemed to work better, but I threw away the previous data so I can’t compare it anymore. At least I got to pass all homework assignments.

The start of this course was a little unpleasant for me, hopefully this feeling won’t stay throughout the coming weeks.

Starting the M202 course soon

One of my colleagues wondered if I already participated in the April 2014 M202 course, but I was too late. This July I will try to finish this course.

M202 is the course for “MongoDB Advanced Deployment and Operations”, which means more depth in sharding, disaster recovery, upgrade strategies and other best practices. The course was supposed to start last week, July 8, but it was postponed because the implementation of some course updates took longer than expected.

The prerequisite for this course is any M102 (MongoDB for DBAs) course. The FAQ mentions that users with a Windows or Mac system should have a Linux VM, because that’s what the course will use. Although not a big problem, my past experience with VirtualBox showed me that the shared folder just doesn’t have enough I/O performance for MongoDB and production release notes suggest that NFS might not be an option either. This would mean enabling a large secondary disk on the VM I will use. VirtualBox used to crash on me in the past, resulting in a corrupted file system for the VM so I am not really looking forward to that.

In a few hours or at most a day from now I hope to view the first videos of this new course. Not yet excited, but that will change once the course started.

How to pick an insane FQDN for your cloud server

Yesterday I learned the logic behind the fully qualified domain naming convention of one of the AWS instances.

The FQDN had the form [environment][year][month][day][hour][minute].company.com, which means a test server started at 2014-01-20 08:01 would have the address test201401200801.company.com. It is hard to reach this server should you ever need to know it’s address for something important, perhaps a dashboard with health status vital to the operations team. The solution to the last bit is to use a more sensible name and then tie it to the server in DNS with a CNAME.

Although I had bookmarked the URL for the server I wasn’t aware there was some logic to the name and definitely not what the format was. So yesterday I heard that the naming convention existed purely because nobody bothered how to manage puppet agents and their certificates properly. The quick solution was to generate a unique name based on the format since at that time it wasn’t likely that instances would spin up fast enough to cause a collision.

Strangely my improved naming convention of using the Epoch time or a SHA-1 hash value weren’t acceptable alternatives 🙂