MongoDB Data Model

MongoDB data model includes:

Tara Modeling Ontology

A Collection is a top model element. Each model can have one or more collections. Collections are analogous to tables in a relational database. Each collection contains documents that are analogous to records in the relational database. Collections model one or more concepts (e.g., account, user, order, publisher, book, etc.) the data is based on.

Documents are JSON-like data structures containing fields that have values of different types (e.g., String, Date, Timestamp, Array, Double, Boolean, etc.). A value can also belong to another document or an array of documents embedded in a document. Documents can have different structures in a collection. However, in most cases in practice, collections are highly homogeneous.

Fields are analogous to columns in the relational database. The field/value pairs (better known as key/value pairs) construct document's structure.

This is an example of a book collection's document from the MongoDB documentation site:

book = { _id: 123456789, title: "MongoDB: The Definitive Guide", author: [ "Kristina Chodorow", "Mike Dirolf" ], published_date: ISODate("2010-09-24"), pages: 216, language: "English", publisher_id: "oreilly", available: 3, checkout: [ { by: "joe", date: ISODate("2012-10-15") } ] }

author field is an array of strings containing names of the book's authors and checkout field is an array of documents containing the checkout details. Other fields have either string or integer basic data types.

Model Types

When creating MongoDB data models, besides knowing internal details of how MongoDB database engine works, there are few other factors that should be considered first:

These factors very much affect what type of model you should create. There are several types of MongoDB models you can create:

There are also other factors that can affect your decision regarding the type of model that will be created. These are mostly operational factors and they are documented at Operational Factors and Data Models, MongoDB documentation site page.

The key question is:

You will need to consider performance, complexity and flexibility of your solution in order to come up with the most appropriate model.

Embedding Model (De-normalization)

Embedding model enables de-normalization of data, which means that two or more related pieces of data will be stored in a single document. Generally, embedding provides better read operation performance since data can be retrieved in a single database operation. In other words, embedding supports locality. If your application frequently access related data objects the best performance can be achieved by putting them in a single document which is supported by the embedding model.

MongoDB provides atomic operations on a single document only. If fields of a document have to be modified together, all of them have to be embedded in a single document in order to guarantee atomicity. MongoDB does not support multi-document transactions. Distributed transactions and distributed join operations are two main challenges associated with distributed database design. By not supporting these features MongoDB has been able to implement highly scalable and efficient atomic sharding solution.

Embedding has disadvantages as well. If we keep embedding related data in documents or constantly updating this data, it may cause the document size to grow after the document creation. This can lead to data fragmentation. At the same time, the size limit for documents in MongoDB is determined by the maximum BSON document size (BSON doc size), which is 16 MB. For larger documents, you have to consider using GridFS.

On the other hand, if documents are large, the fewer documents can fit in RAM and the server will more likely have to page fault to retrieve documents. The page faults lead to random disk I/O that can significantly slow down the system.

Referencing Model (Normalization)

Referencing model enables normalization of data by storing references between two documents to indicate a relationship between the data stored in each document. Generally, referencing models should be used when embedding would result in extensive data duplication and/or data fragmentation (for increased data storage usage that can also lead to reaching maximum document size) with minimal performance advantages or with even negative performance implications; to increase flexibility in performing queries if your application queries data in many different ways, or if you do not know in advance the patterns in which data may be queried; to enable many-to-many relationships; to model large hierarchical data sets (e.g., tree structures)

Using referencing requires more roundtrips to the server.

Hybrid Model

Hybrid model is a combination of embedding and referencing model. It is usually used when neither embedding or referencing model is the best choice but their combination makes the most balanced model.

Polymorphic Schemas

MongoDB does not enforce a common structure for all documents in a collection. While it is possible, but generally not recommended, documents in a MongoDB collection can have different structures.

However, our applications evolve over time so that we have to update the document structure for the MongoDB collections used in applications. This means that at some point documents related to the same collection can have different structures and the application has to take care of it. Meanwhile, you can fully migrate the collection to the latest document structure which will enable the same application code to manage the collection.

You should also keep in mind that the MongoDB's lack of schema enforcement requires the document structure details to be stored on a per-document basis what increases storage usage. You should especially use a reasonable length for the document's field names since the field names can add up to the overall storage used for the collection.