Data Modeling Adviser for MongoDB

The main data related features supported by MongoDB are:

  • Flexible schema - no need to determine and declare structural data elements before inserting data
  • Document-based data model - data model concepts include: collections, documents, fields, references
  • Other than basic structural rules, document structure is not enforced what enables mapping of documents to any object (entity)
  • While it is not required, it is recommended to have uniform document structure across single collections
  • Atomicity - there is no concept of transaction in MongoDB. All operations that create or change data ( e.g., write, update, delete) are atomic at the document level only
  • MongoDB does not support joins
  • MongoDB is designed to take care of many performance aspects by itself but it leaves to designers and developers to figure out the best possible data model for the specific application patterns.
  • It may come as a surprise to some people, but the biggest impact to performance is how well a data model for a specific application fits with its needs (application patterns).

MongoDB data models created in Daprota M2 include:

  • Collections (analogous to RDBMS tables)
  • Documents (analogous to RDBMS table rows)
  • Fields (analogous to RDBMS table columns)
  • References (analogous to RDBMS foreign keys)


Document View


Collection

  • A Collection is a top model element.
  • Collections are analogous to tables in a relational database.
  • Each collection contains documents that are analogous to records in the relational database.
  • Collections model one or more concepts (e.g., account, user, order, publisher, book, etc.) the data is based on.

Document

  • Documents are JSON-like data structures containing fields that have values of different types (e.g., String, Date, Timestamp, Array, Double, Boolean, etc.).
  • A value can also belong to another document, or an array of singular values, or an array of documents.
  • Documents can have different structures in a collection. However, in most cases in practice, collections are highly homogeneous.

Reference

MongoDB resolves relationships by either

  • embedding related documents or
  • referencing related documents.

Embedding approach resolves relationships by storing all related data (documents) in a single document.

Referencing approach resolves relationships by references that points out to related documents.

Embedding is equivalent to de-normalization while referencing is equivalent to normalization in data modeling.

A referenced document can be in the same collection or in the separate collection in the same database or in another database.

Two types of references are supported:

  • Manual Reference
    • Manual References are used to reference documents either in the same collection or in the separate collection in the same database.
  • DBRef
    • Database references are references from one document to another using the value of the referenced (parent) document's _id field, its collection name, and the database name.
    • While MongoDB allows DBRefs without the database name provided, M2 models require the database name to be provided. The reason for this is because a Manual Reference in an M2 model must specify the collection name for the model to be complete in which case the DBRef without the database name from the M2 model point of view is the same as the Manual Reference.
    • The database name in DBRef is more of an implementation aspect of the model and it is needed in order to make the DBRef definition complete. Otherwise, without the database name, the DBRef is the same as the Manual Reference to M2.
    • Unless you have a firm reason for using a DBRef, use Manual References.


References


MongoDB does not support joins. For some data models, it is fine to model data with embedded documents (denormalized model), but in some cases referencing documents (normalized model) is a better choice.

There are several types of MongoDB data models you can create:

  • Embedding Model
  • Referencing Model
  • Hybrid Model that combines embedding and referencing models.


Embedding and Referencing Model


Embedding model enables de-normalization of data, which means that two or more related pieces of data will be stored in a single document. Frequently it is a choice for "contains" and "one-to-many" relationships between entities (documents).

Referencing model enables normalization of data by storing references between documents to indicate a relationship between the data stored in each document.

Hybrid model is a combination of embedding and referencing model. It is usually used when neither embedding or referencing model is the best choice but their combination makes the most balanced model.

The key considerations you have to think about while modeling data that will be stored and managed in MongoDB are:
  • Read and write operations
    • When designing a data model for MongoDB, it is important to know your application patterns. They will help you to better understand how data will be created and used. Based on that understanding, you should be in a better position to improve the design of your data model applying the data model design patterns that are the best fit for the patterns of your application. One of the main questions are:
      • How your data will grow and change over time?
      • What is the read/write ratio?
      • What kinds of queries your application will perform?
      • Are there any concurrency related constrains you should look at?
  • Document growth.
    • Documents can grow by either adding new fields to them, or adding new elements to its array fields, or by frequently updating them. MongoDB has a document size limit of 16 MB. MongoDB will move documents to accommodate their new space requirements. Document moves are generally slow and can also fragment space where the file with document's collection resides.

      MongoDB also stores BSON documents as a sequence of fields and values. When MongoDB writes to a document or updates a field in a document it has to read document sequentially in order to come to the fields to be updated or to add a field with its value to the document. If the document has many fields, this sequental access to the document fields will take longer. If you are dealing with large number of documents during some application operations it can add up to lots of time just spent on moving through the document to access specific fields or to add fileds to the document.

      If an array field is indexed, one document in the collection is responsible for a separate entry in that index for each and every element in its array. So inserting or deleting a document with a 100-element array, if that array is indexed, is like inserting or deleting 100 documents in terms of the amount of indexing work required. Since the BSON data format manipulates documents with a linear memory scan, so that finding elements all the way at the end of a large array takes a long time, and most operations dealing with such a document would be slow.
  • Atomicity
    • There is no concept of transaction in MongoDB. All operations that create or change data (e.g., write, update, delete) are atomic at the document level only. If fields of a document have to be modified together, all of them have to be embedded in a single document in order to guarantee atomicity. MongoDB does not support multi-document transactions. Distributed transactions and distributed join operations are two main challenges associated with distributed database design. By not supporting these features MongoDB has been able to implement highly scalable and efficient atomic sharding solution.
  • Sharding
    • Design a data model to include a field that will be exclusively used as a shard key to enable balanced distribution of data with good support for write scaling and query isolation. First analyze your application read and write operations to get a full understanding of its writing and data retrieval patterns. Check out Selecting a MongoDB Shard Key blog for more details.

You will need to consider performance, complexity and flexibility of your solution in order to come up with the most appropriate model.

Embedding model enables de-normalization of data, which means that two or more related pieces of data will be stored in a single document. Frequently it is a choice for "contains" and "one-to-many" relationships between entities (documents).

Generally, embedding provides better read operation performance since data can be retrieved in a single database operation. In other words, embedding supports locality. If your application frequently accesses related data objects, the best performance can be achieved by putting them in a single document which is supported by the embedding model.

You should embed if

  • There is a one-to-one relationship between the two documents
    This means that there is no redundancy between the documents and embedding one document into another is a natural and efficient way to implement their relationship. This is an easy-to-query structure that also guarantees cinsistency when data is updated/removed in these embedded documents.
  • Data from two or more documents are frequently fetched together
    By embedding one document into another, the query performance will be improved since all data will be stored in the single document. In other words, embedding supports locality.
  • The document, that will be embedded, is not a key document
    If a document is a key document, it means that it is referenced by many other documents. Instead of embedding them, it is more efficient and less error prone to reference key documents. Otherwise documents that are not key documents should be embedded.
  • The child document in a parent-child relationship is a dependent document
    A document is independent if it can be found using only its own fields. Otherwise the document is dependent. The dependent document should be embedded in its "parent" document.
  • Data does not change or does not grow much
    If we keep embedding related data in a document or constantly updating this data, it may cause the document size to grow after the document creation. This can lead to data fragmentation and also slow database performance since MongoDB will have to move document to location where enough space exist to accommodate it. At the same time the size limit for documents in MongoDB is determined by the maximum BSON document size which is 16 MB. All this means that embedding makes sense if documents do not grow much over time after their initial creation.
  • Related documents have similar volatility
    If a document that is considered to be embedded has similar volatility (update, insert, and delete rates are similar) as the "parent" document, than the document should be embedded. Otherwise the referencing approach should be used.

Examples
Embedded One-to-Many
Embedded One-to-One
Pre-Aggregated Report
Product Catalog
Tara Modeling Ontology Embedded

Referencing model enables normalizatiom of data by storing references between documents to indicate relationship between the data stored in each document.

Main features of the referencing model include:

  • No duplicatioon of data;
  • Only one document change is required to change data;
  • No joins. Retrieving data from multiple documents requires multiple queries to be done by an application;
  • lt is a better choice for
    • when embedding would result in extensive data duplication (but would not provide a significant read performance advantage) and/or data fragmentation when embedded documents grow;
    • to present more complex "many-to-many" relationships;
    • to model large hierarchical data sets;
    • fast writes;
    • you should keep in mind that using references require more roundtrips to the server.

These are additional considerations for making design decisions with regards to references:

  • The document, that will be referenced, is a key document
    If a document is a key document, it means that it is referenced by many other documents. It is more efficient and less error prone to reference key documents.
  • The child document in a parent-child relationship is an independent document
    A document is independent if it can be found using only its own fields. Otherwise the document is dependent. The independent documents should be referenced.
  • Data changes or grows much
    If we keep embedding related data in a document or constantly updating data in the document, it may cause the document size to grow after the document creation. This can lead to data fragmentation and also slow database performance since MongoDB will have to move document to location where enough space exist to accommodate it. All this means that the related documents should be referenced.
  • Related documents do not have similar volatility
    If related documents do not have similar volatility (update, insert, and delete rates are not similar) than the referencing modeling should be applied.

These are typical types of relationships that are usually implemented via references:

  • Referenced One-to-N where N is greater than 1
  • Referenced M-to-N where both M and N is greater than 1

Examples
Referenced One-to-Many V1
Referenced One-to-Many V2
Tree with Child References
Tree with an Array of Ancestors
Inventory Management
Process

Design an embedding model where N-side documents will be embedded via an array defined as a field in the "one"-side document.

Exammples

Basic One-to-Few
Embedded One-to-Many

Advantages
  • You do not have to perform a separate query for embedded data; you can get all data (parent and embedded) with a single query
  • If we remove the "one-side" document from MongoDB that removal will be atomic in a context of embedded (dependent) document. For example, when we remove a person doumenat all its embedded addresses will be removed.
Disadvantages
  • You cannot access embedded data as stand-alone entities
  • Embedded data can be duplicated across one-side documents
  • If embedded data is duplicated over one-side documents, data integrity is not guaranteed with updates on embedded data across more than one one-side document. For example, if a street name is changed in an embedded address document and many people live at this address, all one-side person documents with this address will have to be updated. If a system fails at a middle of this update the inconsistent data will be left behind. Some people will get the address updated but people which documents have not been updated before the system failure will still have the old address.

Design a referencing model where a field in the one-side document is an array of references to N-side referenced documents.

Exammples

Basic One-to-Many

Advantages
  • Referenced documents can be easily searched and updated independently
  • Multiple one-side documents can reference N-side documents implying M-to-N referencing model as well
Disadvantages
  • A second query has to be performed in order to get referenced document's data
Additional Optimizations

lf you want to further optimize your model under assumption that your application pattern allows you to do so, you can also apply bi-directional (two-way) referencing in which case the N-side referenced documents can have a manual reference to the One-side document. You have to keep in mind that in this case you are paying the price of not having atomic updates.

Exammples

One-to-Many (Bi-directional)

Again, depending on your application patterns, you may also wish to de-normalize your model on either "N" or "One" side of the relationships in your model. This will eliminate the need to perform the application level joins in some cases. The following sample models demonstrate these techniques:

This kind of denormalization helps when there is a high ratio of reads to updates. Otherwise it could be counterproductive.

Go with "parent" referencing.

Examples

Basic One-to-Squillions

If it is needed, you can also denormalize the "Basic One-to-Squilliones" example. You can either put information abou the "one" side into the "squillions" side (sample model Basic One-to-Squillions (Denormalizing One-Side)), or you can put information from the "squillions" side into the "one" side (sample model Basic One-to-Squillions (Denormalizing Squillions-Side)).

There are more than few hundred (the critical number of the fields depends on the size of the fields) fields in the embedded document.

Transform the embedded document into an embedded document with fields of a document type that group large number of fields into groups of fields based on the application logic. This is an intra-document hierarchy.

Why do we have to do this?

MongoDB has a document size limit of 16 MB. MongoDB also stores BSON documents as a sequence of fields and values. When MongoDB writes to a document or updates a field in a document it has to read document sequentially in order to come to the fields to be updated or to add a field with its value to the document. If the document has many fields these sequental access to the document's fields will take longer. If you are dealing with large number of documents during some application operations it can add up to lots of time just spent on moving through the document to access specific fields or to add fileds to the document.

Examples

Pre-Aggregated Report shows how the 1440-field Minute embedded document was transformed into an embedded document of 24 sub- document fields where minutes are grouped by an hour.

There are several hundred array elements but the critical number of array elements depends on the size of the elements.

Split up array into smaller arrays and use "to-be-continued" ("continuation") document to include all continued arrays.

Why do we have to do this?

MongoDB has a document size limit of 16 MB. MongoDB also stores BSON documents as a sequence of fields and values. When MongoDB writes to a document or updates a field in a document it has to read document sequentially in order to come to the fields to be updated or to add a field with its value to the document. If the document has many fields these sequental access to the document fields will take longer. If you are dealing with large number of documents during some application operations it can add up to lots of time just spent on moving through the document to access specific fields or to add fileds to the document.

Examples

Continuation Document

Design a referencing model.

Why do we have to do this?

MongoDB will move documents to accommodate their new space requirements for enlarged array. Document moves are generally slow (because every index must be updated) and can also fragment space where the file with document's collection resides. If the array field is indexed, one document in the collection is responsible for a separate entry in that index for each and every element in its array. So inserting or deleting a document with a 100-element array, if that array is indexed, is like inserting or deleting 100 documents in terms of the amount of indexing work required. BSON data format manipulates documents with a linear memory scan, so that finding elements all the way at the end of a large array takes a long time, and most operations dealing with such a document would be slow.

Design a referencing model.

Why do we have to do this?

Large arrays require relatively high CPU-overhead. It slows down insert/update/query array elements operations.

Model a document with fields belonging to the common set of attributes and also fields included in a separate embedded document for each type of an object you want to represent.

Examples

Product Catalog