M202 July 2014 Fifth and sixth week

My first plan for this series was to get a blog post for each week. It didn’t work out in the past and because both the fifth and sixth week cover sharding, I decided to make one big post on sharding.

Sharding is used to scale horizontally. Instead of upgrading a MongoDB replica set with more computing or network capacity, you add more replica sets and divide the load among them. Each replica set will have a shard of the entire data. It’s not mandatory to use replica sets for your shards, single servers will do. However, when you consider sharding it’s very likely you want the redundancy and failover of replica sets as well. The reason for this is that once a shard fails, unreachable from any kind of perspective, the data in that shard will be unavailable for both reads and writes. With MongoDB’s failover functionality this becomes less of an availability problem.

During previous courses the basic steps on sharding databases and collections in MongoDB were already explained. The M202 course assumes you won’t need to dig deep for this knowledge, but if you do there’s always the tutorial in the MongoDB documentation. In my opinion sharding data got a little bit easier in version 2.6. The first step is to orchestrate all the pieces of the infrastructure, which could be simplified with configuration management tools. After that you need to explicitly enable sharding on databases and collections.

Mongos process

The recommended way to communicate with the cluster is through one of the mongos processes. These processes provide several tasks in the system. The most important task is to provide a proxy for connections made to other parts of the system. When clients query through a mongos process their queries will be routed to the right shards and their results will be merged before returning to the client. This makes the cluster behave as one big database, although there are limits on the amount of data a mongos process can reasonably sort and process.

Another important task of the mongos process is to provide an interface to manage the cluster. Through one of the mongos processes you are able to configure which databases and collections are sharded, what their shard key is and how data is distributed between shards. Once data is in the cluster, any mongos process is able to push it around by updating the cluster configuration and kicking off the necessary commands on the shards. In order to keep the configuration up to date these mongos processes communicate with the config servers.

Config servers

Config servers are separate mongod processes dedicated to holding metadata on the shards of a cluster. This metadata tells mongos processes which database and collections are sharded, how they are sharded and several other important metadata. Typical production environments will have three config servers in the cluster. This number coincidentally is similar to the recommended size of replica sets, but config servers never form a replica set. Instead they are three independent servers synchronized by mongos processes. All instances should be available, if one of them drops the metadata will be read-only in the cluster.

The videos put emphasis on monitoring config servers. Metadata of a cluster is kept in a special database you’re never supposed to touch unless instructed to do so. The videos make this very clear, and in general it’s a good idea because a mismatch in any of the data might make the data out of sync with the rest of the config servers. If the metadata can’t be synchronized to all of them, then data chunks can’t be balanced and the config server that’s out of sync will be considered broken. Manual intervention is needed to fix a failing config server.

Using ShardingTest

One of the most useful improvements to the Mongo shell that caught my attention was the ability to set up a sharded cluster with relative ease using a ShardingTest object. Of course, this is only for testing purposes because the complete cluster will be running on your current machine. During the fifth and sixth week of the course I made a habit of using tmux with a vertical split so that on the one side I could run the test cluster and on the other I had my connection to the mongos to try out commands.

First, you need to start the cluster. You’ll start with invoking the mongo shell without connection to the database (mongo --nodb). Then do something similar to the code below, which I took from the course and formatted a bit, to start a small cluster for testing purposes.

config = {
  d0 : { smallfiles : "",
         noprealloc : "",
         nopreallocj: "" },
  d1 : { smallfiles : "",
         noprealloc : "",
         noprealloc : "" },
  d2 : { smallfiles : "",
         noprealloc : "",
         nopreallocj: "" } };
cluster = new ShardingTest( { shards : config } );

Once you hit return on that last line (11) your screen will fill up with lots of logging which most of the time is just the different nodes booting and connecting into a sharded cluster system. You could wait until the stream slows down, but the mongos node might already be running at port 30999.

When you’re done, interrupt the mongo shell where you started the cluster and it will shutdown everything. I tried scripting this and making the mongo shell execute it, but once it hits the end of the script it will shut down the cluster. I see no other way than running this while keeping the mongo shell opened.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s