On August 12th the first M101JS (MongoDB for Node.js developers) course started. Previously I completed both the M101J (MongoDB for Java developers) and M102 (MongoDB for DBAs) courses available through 10gen’s education site.
When I finish this course I will have refreshed my MongoDB knowledge and got some experience with Node.js in the form of a blog application. I decided to document my progress in (at least) weekly summaries. This will be part three of seven. You can read part two here.
For this series of blog posts I will structure each post into an introduction (you just passed it) and two sections. The first section, Lectures, will summarize what I learned from the videos and related examples. In the second section, Homework, I will mention what I learned from practicing the homework assignments and anything I might have done extra because of it.
The topic of the third week is schema design. A well performing application needs a good design. Poor decisions might lead to many round trips to the database or even introduce inconsistencies in the data.
The lectures started off with comparisons to relational databases, because that is a field most people will be familiar with. When you use a relational database you get transactions, foreign key constraints and joins, features that are very desirable when dealing with data. Transactions allow for transitions from one consistent state to another. Foreign key constraints force references between data across several tables, which prevents orphaned data. Joins allow you to aggregate data from several tables into one presentation. These features don’t have a direct mapping in MongoDB.
Transactions in MongoDB
MongoDB will guarantee atomic operation inside a document. Your schema design will influence how to access and modify it. This is maybe best explained with an example.
Let’s borrow from the course’s pet project, the blog. In the design the document for the blog post contains an array with all it’s comments. If the programmer always pulls and pushes the full array after changing it, there might be data loss. There might be a moderator removing comments and somebody trying to post a new comment. These actions might interfere with each other. The result could be a removed post reappearing or a new post might not be added. One way to handle this properly is by using MongoDB’s array modification operators to push or remove from the comments array.
Foreign keys and joins in MongoDB
It’s not (yet) possible to put constraints on data similar to how foreign keys work in relational databases. I think it might even slow down database interaction in heavenly distributed data (sharding lectures, which deal will distributed data are part of next week). This means that the application is responsible for ensuring this kind of consistency. However, because MongoDB is document based there is a good alternative to using foreign keys.
Why and what to embed
The video lectures refreshed my memory on why and what to embed. Again, using terminology drawn from his relational database background Andrew explained how to weigh the pros and cons of embedding data in different situations.
The main benefit to embedding, basically denormalizing your data, is that you only need to retrieve one document to get all the data. This saves round trips to the database to collect all the data you wanted to display and, on the database side, the number of times a I/O seek operation is needed to find the right data. However, putting it all in a big document has drawbacks too.
For one, the document might get too big (surpassing the 16Mb limit) which means it can’t be stored. Big documents take up big space as well. MongoDB stores data in memory for faster access and having large documents will cost more.
Another big drawback to humongous documents is that your data set will contain a lot of data duplication. Embedding everything is denormalizing all data, which naturally leads to lots of duplication. This is data that takes up space and is hard to consistently update. The transactions in MongoDB are on a document, but for consistent updates across many documents you would need to lock the collection, which is not desirable (and perhaps not possible) in production environments.
There is a danger in unpredictable sizes of documents. This wasn’t part of the lecture, but mentioned briefly as a con for embedding. The fact is that MongoDB stores all data for a collection in one file and tries to estimate how much space to allocate for new documents. If a document becomes larger that the reserved space it had, MongoDB might allocate new space for data to fit in that collection file. The old space will be available for another document that might fit there. If lots of these migrations happen, you might end up with a lot of holes in the collection’s file. This is a burden on the performance, because I/O seek operations might be needed to find your data.
When to embed
You might wonder when to embed and what the alternative is. It truly depends on how you want to use the data. That’s what kept coming back, you model your schema based on your usage pattern. However, there are some general pointers based on relations you would use in a relational model.
Most of the time it’s smart to embed data when the relation between two things is one-to-one. It will save round trips, but might use too much memory if you don’t use all the data all the time.
If there are just a few items related to one, for instance in the blog project it is assumed that a blog post has at most a few comments. Then you could put those comments inside of the blog post document. Embedding is probably not a good solution when your relationship is closer to a city with inhabitants. In that case your probably want to reference the unique id of the city document in the inhabitant documents.
Finally, in a many-to-many you should always link through references to avoid duplication of data. Embedding is what you do if either it doesn’t duplicate data or the application can guarantee it will eventually be consistent throughout the database.
Modeling rich data with MongoDB
It’s easier to model rich data through embedding. This is explained in the lecture on how to represent tree structures in the database. There is a tutorial available here, it’s more elaborate than what the video lecture showed.
In the lecture the example of categories and sub-categories is used. An online store might have different categories: book (1), video (2), music (3). In books there are the categories fiction (20) and non-fiction (21). The fiction category has horror (45), science fiction (60) and perhaps detective (99) novels. You want to find all the ancestors of the science fiction category in the right order to display a breadcrumb.
In a relational database there might be an extra table with pairs linking each category to an ancestor. This means you have to start with the ancestor of the lowest category (science fiction) and iterate your way up. Finally your reverse the array you built to get the correct order for a breadcrumb
In MongoDB you can store ancestors in an array inside the document. All you have to do is query for the array and arrange it properly. This is much shorter, since you only need one trip to get the complete array.
The third week’s assignments consisted of two assignments to update the blog project and one standalone assignment to change an array inside documents.
The first assignment was to remove the lowest score of two scores of the type ‘homework’. Other types might be quizzes or exams and you weren’t supposed to touch those.
There were two problems in this assignment I had to overcome this week. One was to properly remove the lowest score from the array of scores and the other was to stop the application once it was done. It seems knowing went to close the database was still a problem. I hacked my way to a solution, often re-importing the data when an error interrupted the application halfway through the modification process.
The API gives two options to update objects. One is the save method on a Collection object which should do a full document replace with upsert. It is however mentioned it is “not recommended for efficiency” and it turns out I best avoid it all together because it’s what kept crashing my application. The best option is the regular update method on a collection which gives you more control over what exactly you are updating instead of doing a dangerous full document replacement that might even fail.
When do you close the connection? Simple, when you stop using it. In my Node.js application I wanted to end with the checking function given in the homework assignment to get the input value to pass the assignment. That seemed to be the ideal place to close the connection! So I put the db.close() at the end of the callback for that checking function.
The next step was to figure out when to call that check function. At first I put it after the find(…) which had a callback to modify a given document. Of course callbacks are called asynchronously and putting the check after the loop just meant that the database connection might close at any time, before, after or during the time the loop was still modifying documents.
The trick is to check if the current document in the loop is null. This signifies that it’s the last time the loop will be run. For some reason this actually works, so I was glad. I can imagine that seeing that null and closing soon after could mean there’s still some asynchronous callbacks that aren’t finished but it worked.
The blog assignments were easy to finish, no need to worry about callback because you just need to look up the proper syntax for the calls you make and then you’re done. I did find out that throughout the mongo Node.js driver there are option objects which have a deprecated ‘safe’ option. This is somehow not reflected in the documentation, but in the API it is mentioned.
The safe method was there to make sure a callback was called after the method completed, since otherwise it might be called immediately while the database action was still running. In case of an insert you really need to wait, since the resulting object might still be undefined even if the call will eventually be successful. The ‘safe’ option is now replaced by write-concern option ‘w’, which is more in line with other drivers.
Week four is all about performance. How do you keep queries fast or make them even faster. We will learn how to measure and improve the performance of interactions with MongoDB and hopefully also a bit about how to keep the Node.js application from being the bottleneck.