M202 July 2014 Fifth and sixth week

Shard keys and tags

The first decision when sharding a collection is on the shard key to use. The shard key determines which field is used to split data in to chunks on the cluster. Each chunk will have data between a lower and upper bound based on the value of the shard key in the data. A mongos uses the shard key to determine where to send a particular request. In case the request doesn’t contain the shard key it will be sent to every shard. Shard keys are immutable, that’s why in every MongoDB course there’s an emphasis on picking the right one. The aim in any case is to increase throughput of the whole system by distributing the load over as many shards as possible, picking the right shard key is the most important decision to make.

Starting version 2.4 of it’s possible to use built-in hash-based shard keys. This option ensures that writes to the database are distributed evenly among all shards. Reads, however, will be sent to all shards if the shard key is not in the query. This is something to keep in mind when considering to use hash-based keys.

In some scenarios data should be tied to certain shards. One scenario is where user data from regions are tied to shards in the regional data center. You can achieve this through tagging. By tagging a chunk it will stored on any of the shards with the same tag, or any other tag if no shard has the tag. The videos didn’t go deep into this subject, mentioning it and encouraging discussion in the forum.

Chunk pre-splitting

Just like databases can be pre-heated to avoid the initial penalty of using them, it’s a good idea to pre-split the collection before putting data in. This will allow the cluster to divide the chunks evenly between the shards, increasing the chance of dividing the load of writing data to the cluster.

The shell provides some helper functions to split chunks at a certain value. Most of the pre-splitting work is manual, which means early on in the videos you’re already encouraged to script this process.

There are a few dangers when pre-splitting: you might create chunks that are never used or chunks that will eventually grow so big they will cause problems. One of the videos mentions jumbo chunks. These are chunks so large that they can’t be moved or split any more. They are the result of somehow getting too much data into a range, either through poor pre-splitting or heavy writing into the chunk.

When dealing with shards you definitely want to think ahead about the kind of data and how (fast) this data is coming in so you can create the right sharding strategies.

MongoDB balancer

A main process in the cluster is the balancing of chunks between shards. This is a task scheduled regularly but can also be triggered based on certain predefined parameters in MongoDB. For instance, writes to chunks are monitored and when a certain amount of documents is detected in a chunk it’s eligible for splitting. Another trigger is an imbalance between the number of chunks on each shard. Through video and reading documentation you get to learn the limits.

An important fact is that the mongo balancer uses metrics based on the number of chunks to balance out data. As I mentioned, there is a trigger for number of documents but there is none for the size of the documents. In other words, even if the balancer has balanced all the chunks it might not have balanced the data. It’s possible in a two shard system to have all the data on one shard, but an equal amount of chunks on each shard. Data imbalances like this were demonstrated in exercises and videos showed that chosing poor shard keys or pre-splitting on the wrong ranges can cause these kind of problems in a cluster.

Chunk migration

The cluster has a balancer, but this is not much more than a cooperation of the different processes in the cluster. A mongos process notices a reason to split or balance chunks out and instructs the primary node in the shard to do so. In case of splitting the chunk it’s merely some small administrative steps to update metadata, the more interesting part is the migration process between nodes.

A video showed the current steps in the migration process, but the details might change unannounced. A mongos process triggers the migration of a chunk. It’s current owner and destination are notified of the migration. The current owner takes a few steps to migrate the data, the destination takes a few to add the data and finally the current owner finishes by deleting the data in the chunk. In general not much can go wrong, both origin and destination need to agree the chunk is migrated before queries will be routed to the destination. Until that happens they will still go to the origin shard.

Orphaned data

Migrations can fail. A shard could contain data it isn’t supposed to have according to the config database but could return in particular situations. This data is considered to be orphaned data.

To be more specific, on the receiving end data comes in but is called orphaned until the data is successfully migrated and the receiving shard is marked as the owner of the chunk. While on the sending end data is orphaned once the shard is not the owner but the data isn’t deleted by the background process that finishes the migration.

When querying a shard, the chunk manager on the primary node will filter out any data from orphaned chunks, but secondary nodes are not aware of data being in an orphaned state. Any requests to those secondary nodes might return that orphaned data. Because only the node active as primary during the migration knows which data is in it’s chunks, after fail-over the new primary might not be aware of the data being orphaned. Deletion of orphaned data is not carried on by the new primary, so it will stay on the shard until explicitly removed.

As far as I could tell sharding and it’s configuration metadata are still an abstraction layer and the core functionality of any mongod process is to return available data when queried. Only the primary of a shard has a filter that will leave out data from chunks it doesn’t own, which means a secondary node might return orphaned data depending on the read preference of the client.

Recommended practice is to always connect to mongos processes instead of directly to any mongod in a sharded cluster. This will avoid most problems of orphaned data reaching the client, unless the read preference allows reading from secondary nodes. I didn’t get to test if I could make orphaned data clash with data from the owner of a chunk, it seems possible in exotic scenarios. Currently there are no official tools to detect and clean orphaned data, but one unofficial mongo shell script was mentioned with precautionary warnings.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s