left-icon

MongoDB 3 Succinctly®
by Zoran Maksimovic

Previous
Chapter

of
A
A
A

CHAPTER 1

MongoDB Overview

MongoDB Overview


MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. MongoDB is available under the General Public License (GPL) for free, and it’s also available under commercial license as part of the commercial offering of the company. In this book, we will discuss the functionalities offered by the free version.

MongoDB is one of many implementations of the so-called NoSQL databases, and it’s currently one of the biggest players in this segment of the market.

NoSQL and document databases

If you are new to the NoSQL world, we need to introduce a few concepts.

Broadly speaking, there are three categories of databases:

As developers, we are typically more familiar with the relational databases, such as Microsoft SQL Server, Oracle, MySQL, and Postgres, and with the way those databases organize data in a tabular format. It is also true that, historically, relational databases are most widely used, especially in the corporate world.

NoSQL (originally referring to "non SQL," "non-relational," or "not only SQL") is another type of database that offers a mechanism for storing and retrieving data, and it usually handles data in a different way than relational databases.

NoSQL databases exist in order to solve particular problems for particular domains, and are not a “silver bullet” for any kind of issue, as they have their pros and cons. Some of the main problems that NoSQL databases try to solve are issues of scalability and quantity of data.

In regards to the CAP theorem, NoSQL databases often compromise consistency in favor of availability and partition tolerance. In the NoSQL world, the “eventual consistency” is often used to achieve speed and scalability.

Scalability

One of the advantages the NoSQL databases have is the support for horizontal scalability (or scaling out), which is available—but more limited and expensive—in the RDBMS systems. Horizontal scalability means that we can expand the capacity of the system by adding more servers (nodes). The performance is then almost linearly proportional to the number of nodes that are part of the system.

This idea of horizontal scalability is different from vertical scalability, where typically in order to handle more data, we are upgrading the server itself by adding more memory, HDD space, CPU, etc.

Scaling out is generally the cheaper and more flexible choice because it uses regular commodity hardware, while scaling up is typically much more expensive because the cost of the hardware tends to exponentially increase as it becomes more sophisticated, and in the end its expansion has more limitations.

Horizontal Scalability

Figure 1: Horizontal Scalability

Implementations

NoSQL databases include a wide variety of implementations (typically not encompassing the tabular format) that were developed in response to a rise in the volume of data. Listed here are the various flavors:

  • Document databases pair each key with a complex data structure known as a document (MongoDB, Couchbase Server, CouchDB, RavenDB, and others).
  • Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value (DynamoDB, Windows Azure Table Storage, Riak, Redis, LevelDB, Dynomite).
  • Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.
  • Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB.

NoSQL: What is missing?

Compared to the RDBMS, usually in the NoSQL databases there is no or little support for the following:

  • Limited or no support for JOINS (INNER, OUTER, etc.): The access to the data is done at the document level, and therefore the handling of the links between objects has to be done at the application level.
  • No complex transactions support: NoSQL databases are often supporting eventual consistency transactions, and are typically not supporting batches of updates, but work on single items.
  • No support for constraints: Constraints are not implemented at the database level, but at the application level.

Database structure

MongoDB, as we mentioned previously, is a document database, and it’s quite simple when it comes to the data representation. The database in its simplest form consists of two items: document, which contains data, and collection, which is a container of documents.

Documents

A document is a data structure composed of field and value pairs. Document is basically a JSON object that MongoDB stores on disk in binary (BSON) format.

Figure 2 shows an example of a document representing a user. It is not different from any JSON representation, so you should be familiar with the format. As we are going to see later, there are some conventions used, such the _id field, which is the primary key of this document, and in that sense, <User1> is simply a value of the primary key.

Example of a very simple document.

Figure 2: Example of a very simple document.

BSON is a binary-encoded serialization representation of the JSON. However, BSON supports more data types than JSON (for example, the Date type), and it can be compared to Google’s ProtoBuf. (You can find more information about the BSON format here.)

Tip: In the RDBMS world, we can think of a document as representing a “record” of a table.

There is one hard limit of the document size, which is a maximum of 16 megabytes. This limit hasn’t changed, even on the latest version of the database. That makes sense, as the maximum size limitation ensures that a single document cannot use an excessive amount of RAM or bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API, which will be discussed in Chapter 10.

Collections

MongoDB stores documents in collections. A collection can be seen as analogous to a table in RDBMS. Every document in the collection, unless otherwise specified, has an _id automatically assigned by the database. One thing to note is that the collection is not like a table in which the set of columns (attributes) has to be predefined; collections are schema-less; therefore, a collection can contain any kind of content. However, it is not very practical to have disparate sets of data all in one collection (as this is technically possible), unless in some very particular use cases (data collection, logs, etc). Typically, what happens is that the objects are serialized at the application level and then stored in the database. Therefore, even though the schema is not enforced, some sort of control over the data in a collection will exist.

Thinking in documents

One of the biggest and most fundamental differences between relational databases and MongoDB is data modeling and the way to represent the structure of the data.

MongoDB data structure organization.

Figure 3: MongoDB data structure organization.

In MongoDB, data relationships can be represented either by embedded documents or by references. References pretty much correspond to the usage of foreign keys in RDMBS; however, the support for joins in MongoDB exists, but is quite limited.

Technically there is a way to join the two collections by using the lookup functionality in the aggregation framework, or via LINQ queries (which underneath use the aggregation framework in order to construct queries).

Referencing documents

Referencing documents can be seen as a standard way of normalizing data in the RDMBS, where the tables are linked by the foreign key. MongoDB, in this sense, is not any different.

In a nutshell, by normalizing data into individual collections, we are able to link the data in a very efficient manner by using the primary key (as a foreign key).

Let’s consider the example shown in Figure 4, where we have a user document linked to an address and to a contact. We can see how the user_id (primary key of the user document) is used to link the documents together.

By using this way of linking collections together, we are forced to issue multiple queries in order to retrieve information, as there is no equivalent way of joining information together as we would in a RDMBS. (However, there is the $lookup command, which acts as a LEFT JOIN, introduced in version 3.2 of MongoDB.)

Referencing documents.

Figure 4: Referencing documents.

Embedding documents

By embedding documents, we are able to concatenate all the content into one document.

The same example can be represented simply, as shown in Figure 5. MongoDB offers a way to update the address or contact information directly, but this also means issuing an update to a document.

Embedding documents.

Figure 5: Embedding documents.

Document design strategy

As we have seen, there are mainly two ways of linking the documents. However, the need for one or the other would have to be carefully weighed, as it obviously can have some side effects. Here are some recommendations to follow:

  • Embed as much as possible: The document database should eliminate quite a lot of joins, and therefore, the option we have is to put as much as possible in a single document. This way, the advantage is that saving and retrieving a document is atomic and very fast. There is no need to normalize data. Therefore, embed as much as possible, especially the data that is not being used by other documents.
  • Normalize: Normalize data that can be referred to from multiple places into its own collection. This means creating reusable collections (for example, country or user). This is a more efficient way to handle duplicate values in only one place.
  • Document size: The maximum document size in MongoDB is 16 MB. The limit is imposed mainly in order to ensure that a single document cannot use an excessive amount of RAM or bandwidth. This is quite a large quantity of text data (just think how much data is usually displayed on a single web page). In most cases, this limit is not a problem; however, it’s good to keep it in mind and avoid premature optimizations.
  • Complex data structures and queries: MongoDB can store arbitrary, deep-nested data structures, but cannot search them efficiently. If your data forms a tree, forest, or graph, you effectively need to store each node and its edges in a separate document.
  • Consistency: MongoDB makes a trade-off between efficiency and consistency. The rule is that changes to a single document are atomic, while updates to multiple documents should never be assumed to be atomic. When designing the schema, consider how to keep your data consistent. Generally, the more that you keep in a document, the better, as stated in the first point of this list.

Pluggable storage engine

As modern applications need to support a variety of workloads with different price and performance profiles—from low-latency, in-memory read-and-write applications, to real-time analytics—MongoDB started offering support for pluggable storage engines to achieve the goal of having the same programming API model, but with different implementations.

At the time of writing, MongoDB supports the following engines:

  • MongoDB built-in engine: MMAPv1 engine, which is an improved version of the engine used in prior MongoDB releases.
  • MongoDB built-in default engine: The new WiredTiger storage engine, which provides significant benefits in terms of lower storage costs (better compression), greater hardware utilization, higher throughput, and more predictable performance than the related MMAP engine. Some benchmarks are showing from 7-10x higher performance of this engine.
  • MongoDB engine (only enterprise edition): The in-memory storage engine designed to serve ultra-high throughput.
  • Facebook is supporting MongoRocks, a MongoDB storage engine based on Facebook’s RocksDB-embedded database project.

With these choices in mind, developers can choose the appropriate engine based on their application needs.

Sharding

We have seen that one of the advantages of the NoSQL database is the ability to scale horizontally, and the technique used in MongoDB is sharding.

Sharding.

Figure 6: Sharding.

Sharding is a type of database partitioning that separates large databases the into smaller, faster, more easily managed parts (data shards).

In other words, instead of having to run a huge database on one server, MongoDB offers the ability to separate the load and partition (divide) the data into smaller chunks that could run independently on their own. When writing and reading data from the database, the MongoDB engine will make sure the data gets collected or distributed to the nodes in question accordingly.

Conclusion

In this chapter, we have seen what MongoDB is and how it correlates to the relational database. We looked at the database structure, and how the data gets organized within the database. The emphasis has been placed on the fact that documents are quite different from the normal tabular form.

In the end, we saw how to install the database and which tools come as part of the database. We are now ready to start using the database and its features.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.