left-icon

Azure Cosmos DB and DocumentDB Succinctly®
by Ed Freitas

Previous
Chapter

of
A
A
A

CHAPTER 1

DocumentDB Basics

DocumentDB Basics


Introduction

DocumentDB, as its name implies, stores data as "documents," which are actually JSON objects.

In a relational database, records are stored as rows on a table with specific columns using a defined schema. However, on a document store, records are documents and each one contains specific properties, schema-free.

To understand the differences between a relational and a document database, let’s consider the following table.

Table 1-a: Main Differences between a Relational and Document Database

Relational Database

Document Database

Rows

Documents

Columns and data types

Properties

Strong-typed and defined schemas

Schema-free

Highly normalized

Mostly denormalized

Robust and mature

Simple and lean

Scales vertically (more hardware)

Scales out, horizontally

A relational database table has a fixed structure, and in order to make changes to a table, such as adding a new column, changing a specific column to allow NULL values, or changing a data type, it is necessary to modify the table’s schema, which can create side effects for the existing data already stored in the table.

Because DocumentDB is schema-free, it is not subject to the constraints that relational databases are. In today’s world, where application and software development are associated with frequently evolving data schemas, document databases are a great match.

Let’s look at how some example data would be stored in a traditional relational table and how it would be stored in a document database.

Table 1-b: Example Data Stored in a Relational Database Table

FirstName

LastName

Age

Active

John

Doe

44

Active

Cristiano

Ronaldo

31

NULL

Peter

Pan

67

Retired

As we can see, the data is structured in a tabular format with all records having the same number of columns. Note that I use an Age column for simplicity; in a real database I'd use a Date-of-Birth column.

However, consider the record highlighted in yellow. This record has values in the FirstName, LastName, and Age columns, but the column Active doesn’t have a value; therefore, it is set to NULL. The other records, contrary to the one in yellow, have values in all their columns.

Although the tabular format can accommodate records that do not necessarily have values in the same number of columns (by using NULL values in columns that do not have a value), it still requires those records to have that column, since the schema enforces it.

If we were to accommodate this same data in a nonrelational document database such as DocumentDB, it would look as follows.

Table 1-c: Example Data Stored in a Nonrelational Document Database

Document

Document Data

1

{

“FirstName”: “John”,

“LastName”: “Doe”,

“Age”: 44,

“Active”: “Active”

}

2

{

“FirstName”: “Cristiano”,

“LastName”: “Ronaldo”,

“Age”: 31

}

3

{

“FirstName”: “Peter”,

“LastName”: “Pan”,

“Age”: 67,

“Active”: “Retired”

}

For illustrative purposes only, the data in the previous table is shown in a tabular format, which resembles the format a relational database table uses. However, a document database does not store the data in a tabular format, but instead as a binary representation of the JSON data.

What is known as a table in a relational database context is called a collection of documents in a document database or, simply put, a collection.

Notice that the main differences between relational and nonrelational document databases are that in nonrelational document databases records are called documents and they only store values for properties being used (which would correspond to columns in a relational table). Therefore, NULL properties do not need to be represented within documents, as they are not enforced by a schema.

Furthermore, a nonrelational document database allows documents to store subproperties, which cannot be stored explicitly within a relational table. An example of this would be replacing the FirstName and LastName properties with a FullName property with two subproperties called First and Last.

Table 1-d: A Document with Subproperties

Document

Document Data

2

{

“FullName”:

{

“First”: “Cristiano”,

“Last”: “Ronaldo”

},

“Age”: 31

}

This example shows how, by having subproperties, the document is not normalized in regards to other documents present in the same collection, but a document database can perfectly accommodate this.

This means that each document can have properties and subproperties that might not necessarily be present on other documents, which allows for a high degree of flexibility and denormalization of the data stored.

Master-detail nested properties

In nonrelational databases, master-detail relationships are defined by having nested properties. In a relational database, the only way to achieve this is by having a master-detail relationship between two tables, sharing a common field between them.

Let’s consider this example of a master-detail relationship within a relational database.

Table 1-e: A Typical Master-Detail Relational Database Relationship

FirstName

LastName

Age

Active

PlayerId

John

Doe

44

Active

1

Cristiano

Ronaldo

31

NULL

2

Peter

Pan

67

Retired

3

PlayerId

TeamName

2

Real Madrid

2

Manchester United

1

Mars Galactic Soccer

3

Fantasy FC

In a relational database, like the example above, the only way to indicate that Cristiano Ronaldo has played with both Real Madrid and Manchester United is by having a second (detail) table that contains the names of both teams and adding an extra column to the master table with a unique PlayerId that identifies the record.

This PlayerId column will also exist on the detail table and it will be responsible for linking the detail records with the corresponding record on the master table. This is a standard master-detail relational database relationship.

In a document database, this master-detail relationship is represented as a property with nested properties. Let’s explore how this would look for the record of the player Cristiano Ronaldo.

Table 1-f: A Document with Master-Detail Nested Properties

Document

Document Data

2

{

“FirstName: “Cristiano”,

“LastName”: “Ronaldo”,

“Age”: 31,

“Teams”: [

{

“Team”: “Real Madrid”

},

{

“Team”: “Manchester United”

}

]

}

We can see that the teams that would be stored in the detail table in a traditional relational database are now stored within an array called Teams, in which each team is a nested JSON object that contains a subproperty called Team.

Basically, the master-detail relationship is now described as an array of JSON objects. However, just because the master-detail relationship can be easily described as an array of JSON objects in this case, doesn’t mean it always has to be described as such.

When to use a document database

The power and beauty of nonrelational databases is that you are free to implement and represent the relation between master-detail records in any way that best suits your application and business requirements.

It is also not uncommon in document databases to repeat some data so that each document has the data it needs without having to locate other documents. If the data repeats itself too much, then you may choose to organize the repeating data in different documents, similar to a traditional relational database. In any case, you are free to organize the structure of your JSON documents to what works best for your requirements and applications.

In essence, a document database gives you the freedom to model your data in a way that best fits your needs. 

Despite this flexibility, it’s important to understand that document databases are most suitable when working with data that can be organized into rich hierarchical documents that can be almost entirely self-contained.

If you find yourself modelling a database containing many related documents or documents with a flat structure, this is a sign that a document database is probably not the best option for your application.

When you need a database that scales, a document database is a great option. The primary reasons document databases can scale out are that they don’t impose complex or rigid rules on the data and they are simple and lean by design.

On the other hand, relational databases are better suited for handling complex requirements that do not necessarily have to scale out.

Why DocumentDB?

DocumentDB is Microsoft’s highly scalable document database API that runs on Azure Cosmos DB. Although it has all the characteristics of a typical document database, it also has features that are not available on any other document database. Let’s explore these features.

With DocumentDB, unlike other document databases where you explicitly need to define indexes, all properties are automatically indexed as soon as the document has been added to the database. This allows you to search any property within the document’s hierarchy, however deeply nested it might be.

Furthermore, documents are searchable using a special flavor of SQL that anyone with SQL experience can easily grasp and relate to in an intuitive way—this is DocumentDB’s very own SQL dialect.

Because DocumentDB runs on Azure Cosmos DB, it provides a server-side environment on which you can run JavaScript code that can update multiple documents with full transactional processing. This is a great and easy way to ensure data consistency among multiple documents.

Also, DocumentDB allows tunable performance for your application’s requirements, such as enhancing throughput, indexing, and consistency. Throughput can be scaled up or down instantly by changing the performance tier through the Azure Portal.

Although DocumentDB automatically indexes every property, you can still fine tune the system to exclude any properties or documents that do not need to be indexed, which could even help to improve performance in very specific scenarios.

Even though DocumentDB supports traditional strong as well as eventual consistency (a slightly different form associated with distributed data systems), it also provides two additional options to give you greater control over the tradeoffs between performance and consistency.

All this functionality is nicely packaged as a fully managed and massively scalable Platform-as-a-Service (PaaS) solution that is very easy to setup and get started with. There’s nothing to install, no operating system or updates to manage, and no replicas to set up.

Through the Azure Portal, you can get up and running with DocumentDB in a matter of minutes by using a browser and having an Azure subscription. Take the necessary time to set up your Azure subscription properly.

Sounds exciting, so let’s explore some of these features with some additional details.

Rich SQL flavored queries

One of the best features of DocumentDB is that its native querying language is very similar to SQL. For those doing development with .NET, there’s a LINQ provider as well.

Although DocumentDB’s flavored queries are written in SQL, they are deeply rooted in JSON and JavaScript semantics. They allow you to query over hierarchical nested data and arrays within documents and also share custom projections from the results of your queries. Let’s have a look at an example.

Code Listing 1-a: A DocumentDB SQL Flavored Query

-- SQL flavored query.

SELECT ch.First, ch.Last

FROM Players AS p

JOIN ch IN p.children

WHERE p.Age = 31

In this flavored SQL query example, we are doing a few things. The JOIN clause basically allows DocumentDB to iterate through all the children (properties) nested within the Players document.

The WHERE clause filters by checking the documents that have an Age property value equal to 31. Notice that dotted notation is used to refer to the properties within the document. The dotted notation can nest as far down as the document’s hierarchy goes.

Because on the SELECT clause we are selecting two properties instead of a SELECT *, DocumentDB projects a new JSON object that only contains the properties being queried, instead of the whole document as the result.

Once a document is inserted into DocumentDB, it is pretty much searchable instantly, as it is automatically indexed. This indexing behavior can be tuned. However, there’s usually no need.

Client-side development

DocumentDB provides various SDKs, which allow easy integration from your preferred development platform. There are SDKs for several of the most common development platforms, such as .NET, Node.js, JavaScript, Java, and Python.

It is no different than other platforms, as it also provides a REST/HTTP API that can be used as long as the headers of the HTTP request contain valid authentication information and the request points to the right HTTP DocumentDB resource.

Working with the REST/HTTP API is the most primitive way to interact with DocumentDB; it can become very tedious when you need to focus most of your attention on the logic of your application. Therefore, we will focus on using the .NET SDK in this e-book when doing client-side development.

If an SDK isn’t available for your development platform, you can use the REST/HTTP API. However, you should also contact Azure support, let them know what platform you are using, and provide them your feedback. Hopefully, they can take your opinion into account and include an SDK for your development platform in their roadmap. The Azure team is very active and committed to the platform, so if there isn’t an SDK for your platform, it is likely because nobody has asked for it yet.

Server-side development

DocumentDB is a server sandbox environment that gives you the ability to execute logic inside, where the data resides. Server-side logic in DocumentDB can be wrapped up into stored procedures and triggers, and user-defined functions (UDFs), which is strikingly familiar from working with relational database systems like Oracle and SQL Server.

However, there’s a subtle difference between server-side logic written in DocumentDB and server-side logic written using traditional relational databases. In DocumentDB, server-side logic is written in JavaScript instead of SQL, which makes it a perfect companion to deal with JSON objects.

DocumentDB embraces JavaScript as a sort of modern-day SQL by supporting the transactional script execution natively inside the database engine.

Because DocumentDB is a fully hosted service, it cannot allow scripts that perform poorly to run indefinitely, as this could risk the integrity of the whole service. Therefore, it enforces a paradigm known as bounded execution, which basically determines how much time your logic can run before timing out.

All server-side logic execution is fully transactional, which means that if you update some documents and an error occurs, or one of your scripts times out due to bounded execution before it actually completes, then all updates up to that moment are rolled back automatically. If your server-side code completes successfully, then all updates are guaranteed to commit together. This makes DocumentDB an even more compelling option.

Scalability

With NoSQL databases, scalability is key to success and DocumentDB delivers. It is already the back-end of choice for services like Xbox and Office OneNote, which rely on DocumentDB for databases containing tens of terabytes of JSON documents, over a million users, and an uptime availability of 99.9%.

To give you a sense of how much DocumentDB can scale: it can grow as much as you can afford or to the end of the available hardware in Azure’s data centers, whichever limit you reach first. This is a profound statement and a clear indication of what DocumentDB can handle and how much it can scale. It also serves as the foundation used for the creation of Cosmos DB.

In simple terms, DocumentDB can massively scale to hundreds of terabytes and even petabytes through thousands of nodes. It can scale up and down and also out.

It is able to scale up and down seamlessly, which consists of a combination of computing power and storage capacity.

DocumentDB is also able to scale out by adding more collections. A collection of documents can essentially be seen as a scale unit. If your database grows beyond 10 GB, then you can scale out by simply adding more collections and then partitioning your data across multiple collections.

Consistency

Another important feature that makes DocumentDB great is its ability to tune consistency. There’s usually a tradeoff between performance and consistency.

Basically, strong consistency slows down reads and writes and eventual consistency doesn’t always return the most current data. With strong consistency, you get consistent query results as writers make changes to the database, but you pay a price in performance, as all queries must wait until all replicas have been updated with the latest changes, which obviously slows things down a bit.

Conversely, eventual consistency gives you the best performance, but you cannot fully rely on query results. They might return data that is not entirely consistent with what other users might be updating, given that not all replicas are necessarily up to date.

DocumentDB supports both of these consistency methods and three additional methods that fall somewhat in the middle of strong and eventual consistencies. These three additional methods are called bounded staleness, session, and consistent prefix.

Bounded staleness lets you tolerate inconsistent query results by guaranteeing that those results are at least consistent enough within a specified period of time.

Session consistency, which is actually the default consistency method used within DocumentDB, can be thought of as a hybrid experience. Writers are guaranteed to have strong consistency for data that they have themselves written, while everyone else operates with eventual consistency.

Consistent prefix guarantees that in absence of any further writes, the replicas within the group eventually converge. Azure Cosmos DB accounts that are configured with consistent prefix consistency can associate any number of Azure regions with their Azure Cosmos DB account.

DocumentDB allows consistency to be tuned and changed, giving you the flexibility to work with the approach that best suits your business requirements and needs.

More information about tunable data consistency levels in Azure Cosmos DB using the DocumentDB API can be found here.

Costs

Before the introduction of Cosmos DB, when DocumentDB was offered as a standalone service, costs were based on pre-set tiers and collections.

In DocumentDB prior to Cosmos DB, a collection was not only a unit of scale, but also directly related to costs and pricing. You would pay per collection, and each collection had a storage capacity of up to 10 GBs.

With the introduction of Azure Cosmos DB, the old pricing schema based on the combination of collections and tiers became irrelevant and not fully elastic.

The old DocumentDB S1, S2, and S3 performance levels did not offer the flexibility that DocumentDB API collections now offer going forward with Cosmos DB. This was because in the S1, S2, and S3 performance levels, both the throughput and storage capacity were pre-set and did not offer elasticity.

Azure Cosmos DB now offers the ability to customize your throughput and storage, offering you much more flexibility to scale as your application needs change. More information on this topic can be found here.

Summary

In this chapter we’ve quickly learned about some of the most prominent NoSQL document database characteristics and specifically about DocumentDB and how it compares to traditional relational databases.

In particular, we focused on describing how DocumentDB is designed from the ground up to scale out and work with hierarchical JSON documents schema-free, unlike traditional tabular database tables that require a schema and complex joins to piece together.

We also quickly described how DocumentDB enables instant searching across documents and how every property is automatically indexed when a document is added to a collection, allowing you to quickly and easily query them using a familiar SQL-like syntax.

Furthermore, we discussed how client and server-side programming are possible through various platform-specific SDKs and using JavaScript with full transactional processing.

Finally, we described how DocumentDB permits tunable consistency and supports elastic scaling. We also talked briefly about costs.

This sets the stage for the following chapters and gives us a solid starting point and a high level understanding of what DocumentDB can allow us to accomplish. In the following chapters, we’ll focus on diving deeper into each of DocumentDB’s features and exploring each one of them by writing some code. Thanks for reading!

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.