left-icon

MongoDB 3 Succinctly®
by Zoran Maksimovic

Previous
Chapter

of
A
A
A

CHAPTER 5

Data Retrieval

Data Retrieval


We briefly mentioned how to retrieve data in the previous chapter, in order to support operations such as create, delete, and update. MongoDB is quite rich when it comes to data retrieval, as you will see.

Querying a collection

The starting point for the data retrieval is the db.<collection>.find() method. By just quickly using the help() method, we can see that the find() method offers several other modifiers for performing any kind of query over a collection.

It is possible to sort and apply the filters to the search itself. The following image shows the modifiers the find() contains:

Find() method.

Figure 26: Find() method.

By definition, the queries are issued on a single collection; therefore, joining to other tables, as we would expect at this stage, is not possible. In order to achieve this, we need to either use the aggregation or the MapReduce mechanism, which will be discussed in the coming chapters.

In order to perform a query, we would usually need to specify a filter, unless we are retrieving all the documents of a collection. MongoDB uses some comparison operators that really come in handy for data comparison and matching:

Table 9: Query comparison operators

Name

Description

$eq

Matches values that are equal to a specified value.

$gt

Matches values that are greater than a specified value.

$gte

Matches values that are greater than or equal to a specified value.

$lt

Matches values that are less than a specified value.

$lte

Matches values that are less than or equal to a specified value.

$ne

Matches all values that are not equal to a specified value.

$in

Matches any of the values specified in an array.

$nin

Matches none of the values specified in an array.

And in addition to these, we have some logical operators, such as:

Table 10: Logical operators

Name

Description

$or

Joins query clauses with a logical OR; returns all documents that match the conditions of either clause.

$and

Joins query clauses with a logical AND; returns all documents that match both the conditions.

$not

Joins query clauses with a logical AND; returns all documents that match the conditions of both clauses.

$nor

Joins query clauses with a logical NOR; returns all documents that fail to match both clauses.

For additional information, you may check the reference card directly on the MongoDB website.

Using the method find() without any parameter will simply return anything available in the given collection. While this is useful in some simple scenarios, the tendency is to search for specific information by matching some of the attributes (fields) of a document to be returned. This is achieved by specifying the <query filter>, the first parameter of the method.

The basic signature of the find() method accepts two parameters. Both parameters are optional:

Code Listing 28: Collection find method signature

db.collection.find(<query filter> , <projection>)

Query filter: Represents the filter that is going to be applied while searching for data (exactly the same thing as the WHERE clause in the RDBMS database).

Projection: Provides a mechanism of specifying which data we want to return back. If we only want to return a specific field(s), this is the place to specify it.

As an example, if we were to retrieve users whose firstname and lastname match a given value, we would use the following notation (the formatting on multiple lines is simply for better readability):

Code Listing 29: Usage of an AND operator on a query

db.users.find(

{

     $and: [

              { firstname: "john" },

              { lastname : "doe" }

           ]

 });

Square brackets would contain an array of arguments. As demonstrated in Figure 27, we can see that the query returned the document that matches the values we specified.

Searching for firstname and lastname.

Figure 27: Searching for firstname and lastname.

At the same time, if we were to combine the $and and $or, the following query is made:

Code Listing 30: Usage of AND and OR on a query

db.users.find(

{

     $and: [

              { firstname: "john" },

              { lastname : "doe" }

           ],

     $or:  [

              { _id : 1 }

           ]

});

When this runs in the MongoDB shell, the result is as follows:

Specifying the filter containing AND and OR.

Figure 28: Specifying the filter containing AND and OR.

At the same time, if we want to use some comparison operators, the basic usage is to place the operator inside the curly brackets together with the value in question:

Code Listing 31: Searching for a user where the id is greater than 1

db.users.find( { _id : {$gt: 1} });

This query searches for items where the _id value is greater than 1.

Table 11: Examples of query filters

Name

Description

{lastname : “doe”}

Docs in which lastname is equal to doe, or an array containing the value doe.

{age: 10, lastname: “doe”}

Docs in which age is equal to 10 and lastname is equal to doe.

{age: {$gt: 10}}

Docs in which age is greater than 10. Also available: $lt (<), $gte (>=), $lte (<=), and $ne (!=).

{lastname: {$in: [“doe”, “foo”]}}

Docs in which lastname is either doe or foo.

{a: {$all: [10, “hello”]}}

Docs in which a is an array containing both 10 and hello.

{“name.lastname": “doe”}.

Docs in which name is an embedded document with lastname equal to doe.

{a: {$elemMatch: {b: 1, c: 2}}}

Docs in which a is an array that contains an element with both b equal to 1 and c equal to 2.

{$or: [{a: 1}, {b: 2}]}

Docs in which a is 1 or b is 2.

{a: /^m/}

Docs in which a begins with the letter m. One can also use the regex operator: {a: {$regex: “^m”}}.

{a: {$mod: [10, 1]}}

Docs in which a remainder after division with 10 is 1.

{a: {$type: 2}}

Docs in which a is a string. (See bsonspec.org for more.) { $text: { $search: “hello” } } Docs that contain hello on a text search. Requires a text index.

Projections

Let’s imagine a situation in which we have a document with a very large number of attributes (fields), but we only need to return one or two. This can be done by specifying the projection parameter in the find() method.

In the this chapter, we have seen how to filter for some data, and we have also seen that all the attributes get returned.

The projection is actually very simple; we simply need to specify the list of attributes that we want to be included or excluded in the output. And this works as follows:

In order to return only the firstname attribute from the query, we will specify the following:

Code Listing 32: Finding all users and returning the firstname only

db.users.find( { },  {_id: 0, firstname: 1} );

The number 1 in this case actually means true or include, and 0 means false or exclude.

The _id field will be always returned, unless explicitly excluded from the query.

Returning only the firstname from the query.

Figure 29: Returning only the firstname from the query.

Sorting

In order to sort out the result, we can use the sort() modifier on top of the find(). This looks like the following:

Code Listing 33: Specifying the sorting

db.users.find({}, {}).sort( { _id : -1} );

The sort() modifier accepts a list of fields on which we can assign two values, negative (-1) or positive (1), where:

  • Negative (-1): descending order
  • Positive (1): ascending order

Sorting by descending or ascending order.

Figure 30: Sorting by descending or ascending order.

Limiting the output

In order to limit the number of documents being returned, we can use the limit() modifier. By specifying the value on the limit(), only that number of documents will be returned.

Code Listing 34: Limiting the number of documents returned by a query

db.users.find({}, {}).limit( 10 );

In this example, only the first 10 documents will be returned.

Cursor

When invoking the find() method, a cursor is being returned, and it can be used to iterate through documents quite easily, as demonstrated in Code Listing 35.

The query returns all the documents, and we are using the cursor to implement some specific logic at a document level. Our example here is used purely to demonstrate few things:

  • We can iterate through documents.
  • We can instantiate a single document.
  • We can access document internal values.

Code Listing 35: Iterating through the cursor

var cursor = db.users.find({}, {});

while (cursor.hasNext()) {

    var document = cursor.next();

   

    if(document) {

        var firstname = document.firstname;

        print (tojson(firstname));

    }

};

In the MongoDB shell, we can see that “John” and “Mike” are returned as part of the query.

cursor in MongoDB shell.

Figure 31: cursor in MongoDB shell.

The same can be achieved by using the forEach() function that will iterate through all the documents. It makes it a bit easier to code, but is substantially the same as the previous technique.

Code Listing 36: Using forEach on a cursor

var cursor = db.users.find({}, {});

cursor.forEach( function(document) { 

        var firstname = document.firstname;

        print (tojson(firstname));

});

Aggregations

What we saw in the previous chapter was mainly related to retrieving documents from a given collection by filtering and projecting. What we haven’t seen is the way to aggregate data; and this is simply because the find() method doesn’t offer a way to do this.

The MongoDB documentation defines aggregations as follows: Aggregations are operations that process data records and return computed results.

Therefore, MongoDB offers a way to perform operations (computations) and to group values together in order to return a single result. With this functionality, we are able to perform any kind of analytic tasks on the data available in the database.

If you are coming from the RDBMS world, this is pretty much corresponding to the way of using operations around GROUP BY (sum(), avg(), count(), etc.).

MongoDB offers three ways to achieve this:

  • Aggregation pipeline
  • Map-reduce function
  • Single-purpose aggregation methods

The aggregation pipeline

The MongoDB Aggregation Framework offers a way of processing documents in various stages (as a pipeline). Each stage transforms the documents as they are passing through the pipeline.

When executing a pipeline, MongoDB pipes operators into each other. If you are familiar with the Linux concept of a pipe, then that’s the closest analogy. In a nutshell, this means that the output of an operator becomes the input of the following operator. The result of each operator is a new collection of documents. For instance, one possible pipeline would be as follows:

Figure 32: Aggregation pipeline example.

We can add as many operators in the pipeline as we like; we can also add the same operator more than once, and at a different position in the pipeline.

Figure 33: Aggregation example with repeating operators.

When it comes to implementation, MongoDB offers the aggregate() function that accepts the list of stages to be applied. The application of stages happens as a sequence is passed to the method. The method signature is as follows:

Code Listing 37: Aggregate method signature.

db.collection.aggregate([ { <stage 1> }, { <stage 2>}, … ]);

Possible stages are described in the following table from the MongoDB documentation, so let’s dive into more details:

Table 12: Pipeline operators

Operator

Description

SQL equivalent

$project

Changes each document in the stream, such as by adding new fields or removing existing fields. For each input document, outputs one document.

SELECT

$match

Filters the document stream to allow only matching documents to pass unmodified into the next pipeline stage. $match uses standard MongoDB queries. For each input document, outputs either one document (a match) or zero documents (no match).

WHERE

$redact

Reshapes each document in the stream by restricting the content for each document based on information stored in the documents themselves. Incorporates the functionality of $project and $match. Can be used to implement field-level redaction. For each input document, outputs either one document or zero documents.

N/A

$limit

Passes the first n documents unmodified to the pipeline where n is the specified limit. For each input document, outputs either one document (for the first n documents) or zero documents (after the first n documents).

LIMIT or TOP xxx

$skip

Skips the first n documents where n is the specified skip number, and passes the remaining documents unmodified to the pipeline. For each input document, outputs either zero documents (for the first n documents) or one document (if after the first documents).

LIMIT ..OFFSET

$unwind

Deconstructs an array field from the input documents to output a document for each element. Each output document replaces the array with an element value. For each input document, outputs n documents where n is the number of array elements and can be zero for an empty array.

N/A

$group

Groups input documents by a specified identifier expression and applies the accumulator expression(s), if specified, to each group. Consumes all input documents and outputs one document per each distinct group. The output documents only contain the identifier field and, if specified, accumulated fields.

GROUP BY

$sample

Randomly selects the specified number of documents from its input.

N/A

$sort

Reorders the document stream by a specified sort key. Only the order changes; the documents remain unmodified. For each input document, outputs one document.

ORDER BY

$geoNear

Returns an ordered stream of documents based on the proximity to a geospatial point. Incorporates the functionality of $match$sort, and $limit for geospatial data. The output documents include an additional distance field and can include a location identifier field.

N/A

$lookup

Performs a left-outer join to another collection in the same database to filter in documents from the “joined” collection for processing.

JOIN

$out

Writes the resulting documents of the aggregation pipeline to a collection. To use the $out stage, it must be the last stage in the pipeline.

SELECT

$indexStats

Returns statistics regarding the use of each index for the collection.

EXPLAIN

To demonstrate the usage of the aggregation, we need to have some data to work against; therefore, we will be working on a collection of books, which will be in a very simple format: title, author, number of pages, and language.

So, let’s populate the database with some sample data:

Code Listing 38: Simple data to be used for aggregation

db.books.insert(

[

   {_id: 1, title: "Anna Karenina", author: "Leo Tolstoy", pages : 500, language: "russian"},

   {_id: 2, title: "Madame Bovary", author: "Gustave Flaubert", pages : 450, language: "french"},

   {_id: 3, title: "War and Peace", author: "Leo Tolstoy", pages : 470, language: "russian"},

   {_id: 4, title: "The Great Gatsby", author: "F. Scott Fitzgerald", pages : 300, language: "english"},

   {_id: 5, title: "Hamlet", author: "William Shakespeare", pages : 150, language: "english"}

]);

Let’s try to return the book with the largest number of pages, per language.

In order to achieve this, one of the strategies would be to:

  1. Group the documents by language (in blue), by specifying the item on which the grouping will be performed.
  2. Use the $max operator to find out which book has the largest number of pages (in red) for that given language.

Code Listing 39: Data aggregation (grouping)

db.books.aggregate(

   [

     {

       $group:

         {

           _id: {language: "$language"},

           pages: { $max: "$pages" }

         }

     }

   ]

);

One thing to note in this query is that we are using the $field notation to refer to the value of the field being processed ( $max : "$pages").

Execution returns the result shown in Figure 34.

Data aggregation (grouping).

Figure 34: Data aggregation (grouping).

In order to apply the stages concept described at the beginning of the chapter, let’s simply add one more requirement: let’s make the previous calculation, but only apply it to the English language. That means we are not interested in knowing anything about languages other than English.

We can rewrite the query and divide it into two stages:

  1. $match: We are filtering out all the items we don’t want to perform the calculation on. That means we limit our data to only “english language” before starting the calculation.
  2. $group: Everything is as before; the only difference is that the input to this stage would be consisting of books only written in English.

Now, the query looks as follows:

Code Listing 40: Data aggregation (grouping) with filtering ($match)

db.books.aggregate(

     [

          {

              $match : { language : "english" }

          },

          {

             $group:

               {

                  _id: {language: "$language"},

                  pages: { $max: "$pages" },

                  title: { $first: "$title"}

               }

          },

     ]

);

As shown in the MongoDB shell result, we can only see the values for english. All the rest is not visible.

Aggregated Pipeline

Figure 35: Aggregated Pipeline: group books by prefiltering collection.

The interesting point in the query above is the order of the operations we used. We could have certainly moved the $match part to be after the $group, in which case the result would be exactly the same. What would change is the processing of the data between stages. For instance, it makes sense to filter as much as possible before performing calculations, as the amount of work to be done by the engine would be smaller than if we only filtered the result at the end. This part about efficiency should be always considered when writing queries.

MapReduce

MapReduce is yet another way of aggregating data, and it is typically used for processing a large amount of data in order to obtain a condensed view of the same.

Every MongoDB collection has a mapReduce() command as part of it, with the following signature:

Code Listing 41: MapReduce method’s signature

db.<collection>.mapReduce (

    mapFunction,

    reduceFunction,

    {

        <out>,

        <query>,

        <sort>,

        <limit>,

        <finalize>,

        <scope>,

        <jsMode>,

        <verbose>,

        <bypassDocumentValidation>

    }

);

Map and reduce are actually JavaScript functions, which we have to define.

Map

The map function is responsible for transforming each input document into zero or more documents. The context of the map function is the collection itself; therefore, you can use this within the function. Emit() is responsible for creating the output. The map function can be also seen as a mechanism of a GROUP BY in the RDBM world. The main goal is to return values normalized and grouped by a common key, which can be any of the properties of the collection, and a set of values that belong to this key.

Code Listing 42: Basic map function

var map = function(){

  /* emit values for each document */

  emit(key, <values>);

}

For Microsoft.NET developers, the output of the map function can be seen as IDictionary<object, IList<object>>, or a key associated with a set of values.

We can call emit() more than once if we want, or do any logic to manipulate the data we want to group.

Note: If an item contains only one record after the map() command is executed, MongoDB won’t perform any reduce function, as it’s already considered to be reduced.

Reduce

The reduce function simply gathers results and does something with them, such as reducing or grouping the items based on the same key, and doing something with the values, such as calculating the sum or quantity. The basic signature is as follows:

Code Listing 43: Basic reduce function definition

var reduce = function(key, value){

  /* reduce emitted values into result */

  return {result1: one, result2: two};

}

If we execute the same example as we have done previously with the aggregation pipeline (to get the books with the largest number of pages), then the mapReduce would look similar to the following:

Code Listing 44: MapReduce full example

/* 1. defining the map function */

var map = function() {  

     emit(this.language,

     {

          pages: this.pages,

          title: this.title

     });  

};

/* 2. defining the reduce function */

var reduce = function(key, values) {

     

     var max = values[0].pages;

     var title = values[0].title;

     

     values.forEach(function(value) {

          if(value.pages > max){

               max = value.pages;

               title = value.title;

          }

     });

     return  {pages: max, title: title} ;

};

 

/* 3. calling the map reduce against the books collection a*/

db.books.mapReduce(map, reduce, {out: { reduce:"biggest_books" }});

 

/* 4. Retrieving the result */

db.biggest_books.find();

The output of this query is as follows, and it’s pretty much the same as the previously obtained one:

Code Listing 45: MapReduce data returned (result)

{ "_id" : "english", "value" : { "pages" : 300, "title" : "The Great Gatsby" } }

{ "_id" : "french",  "value" : { "pages" : 450, "title" : "Madame Bovary"    } }

{ "_id" : "russian", "value" : { "pages" : 500, "title" : "Anna Karenina"    } }

About the sequence of the execution:

  1. In step 1, we define the map function. For every row in the books table, the function will return an object that will contain the key = language itself, and the values associated will be the book title and the number of pages.

Code Listing 46: Example of data returned

{ english } => [

                   { pages: 300, title: "The Great Gatsby"},

                   { pages: 150, title: "Hamlet"}

               ]

{ russian } => [

                   { pages: 500, title: "Anna Karenina"},   

                    { pages: 470, title: "War and Peace"}

               ]

{ french } =>  [    {pages: 450, title: "Madame Bovary"} ]

  1. In step 2, we define a reduce function. The reduce function’s responsibility is to loop through the array of values and find out which number of pages has the biggest value. In order to return the values of the title, we have to map the title as well on every loop cycle.
  2. Finally, we call the mapReduce() function, where the functions declared previously are set as arguments. One interesting aspect of this query is that it outputs the result in a new table called biggest_books.

As part of the output options, we could also choose to return the result as a cursor, in which case we would have something like the following:

Code Listing 47: MapReduce with cursor

 var values = db.books.mapReduce(map, reduce, {out: { inline: 1 }} );

print(tojson(values));

Once we have the values cursor filled in, we can iterate through values as we did in the previous chapter where we discussed cursors.

For extensive information about MapReduce in MongoDB, see the official documentation.

Single-purpose aggregation operations

MongoDB also offers some more aggregation commands that can be directly executed against a collection.

Table 13: "Simple" aggregation commands

Name

Description

count

Counts the number of documents that match the query; optional parameters are: limit, skip, hint, maxTimeMS.

distinct

Displays the distinct values found for a specific key in a collection.

group

Groups documents in a collection by the specified key and performs simple aggregation.

Count

Count is a very simple function to execute against a collection. By just calling the count() function, we get back the result.

Code Listing 48: Count method signature

db.books.count(query, options);

As we can see in Figure 36, we can also specify the query that acts as a data filter.

Counting books with or without query.

Figure 36: Counting books with or without query.

There are also a few options that can be specified, such a limit, skip, and maxTimeMS (max time in milliseconds), that could help us further specify the options of the count.

Keep in mind that count() can also be used on a cursor after we use the find() function.

Using count() after searching for data (cursor).

Figure 37: Using count() after searching for data (cursor).

Distinct

Distinct will return the list of distinct values as specified in the field parameter. Query can also be used to further specify the data worked against.

Code Listing 49: Distinct method signature

db.books.distinct(field, query);

Returning distinct values from a collection.

Figure 38: Returning distinct values from a collection.

Group

The group method groups documents in a collection by the specified keys and performs simple aggregation functions, such as computing counts and sums. The method is analogous to a SELECT <...> GROUP BY statement in SQL. The group() method returns an array.

The signature of the method is as follows:

Code Listing 50: Group method signature

db.<collection>.group ( { key, reduce, initial [, keyf] [, cond] [, finalize]})

Let’s explain the most useful parts of the method:

Table 14: Group method parameters

Name

Description

key

Represents the field or fields to group. This will be used as a key of the group, to which all of the other values will be referenced.

reduce

An aggregation function that operates on documents during the grouping operation. These functions may return a sum or a count. The function takes two arguments: the current document, and an aggregation result document for that group.

initial

Initializes the aggregation result document.

keyf

Optional. Alternative to the key field. Specifies a function that creates a “key object” for use as the grouping key. Use keyf instead of key to group by calculated fields rather than existing document fields.

cond

Optional. Contains the query if we want to work only against a subset of data.

finalize

Optional. Before the result is returned, it can perform formatting of the data. It’s the last method executed within the query.

Let’s see the example of if we were to calculate the sum of all of the pages per given language:

Code Listing 51: Grouping example

db.books.group (

{

    key : {language: 1 },

    reduce : function( currentDocument, result) {

        result.total +=  currentDocument.pages;

    },

    initial: {total: 0}

});

As executed in the shell, it looks like the following:

Grouping by language and summing up the number of pages.

Figure 39: Grouping by language and summing up the number of pages.

Conclusion

In this chapter, we have briefly seen how MongoDB is rich with functionalities when it comes to querying and aggregating data. For the sake of brevity, we haven’t seen all the cases, as this probably would require a document on its own. However, I hope that with this the reader can at least identify the elements that might be needed. For further information, see either the official documentation or the myriad of blog posts that explain some corner cases.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.