Mongodb Thinking in Documents webinar by Mike Friedman

In notebook:
Article Notes
Created at:
2016-03-12
Updated:
2016-03-12
Tags:
JavaScript libraries backend
Webinar: Back to Basics 1: Thinking in Documents | MongoDB

Mike Friedman

Agenda

  • What is a Record?
  • Core Concepts
  • What is an Entity?
  • Associating Entities
  • General Recommendations

Why?

He argues that "All application development is Schema Design"
The software developers job is first to design your data model. Even if you don't even store data.
"Success in development comes down to proper data structure."
True if data is in memory or on disk

What is a record?

Key -> Value

Key -> Blob

Single key is associated with a single value which is a blob. The storage system (db) has no knowledge what is inside this blob. It's an opaque chunk of data.
We can only query on the key. The above means that the blob can only be replaced not update (data storage has no knowledge of the contents of the blob).

Relational

Tables with rows and columns. You can query by any column.

Two-dimensional storage (tuples). Each field contains a single value and you can query on any field.
Very structured schema, you can do in-place updates - update any cell.
Disadvantage: Normalization requires lost of tables, indexes, joins and poor locality
On highly distributed system this can become a bottleneck

Document model

N-dimensional storage - we can have as many dimensions we want. Each field can contain 0,1 or many values or embedded values (other documents or arrays).
In MongoDB we can query on any field at any level (embedded values).!
I can look documents up by reaching down any documents down. The deeply nested structure is transparent.
The schema is flexible - not every document has to look the same. In general we want them to look similar. Don't have to redundantly represent every piece of data in every document where it doesn't apply.

Do inline updates - including a single value in an embedded document
Optimal data locality - we can get all the data we need from the database without doing multiple queries (connections) or joins

Core Concepts

Basic review of relational databases.
Traditional schema design focuses on primarily on data storage: What are the atomic single pieces of data that we can isolate and put in a place so there's no duplication?
In documents based design: the focus is how the data is used? What questions my users are going to have that I will have to answer in an efficient way?

To compare:

Traditional: What answers do I have?
Document based: What questions do I have?

It's these question that guide the design of the schema. For the same data their might multiple correct answers based on how the data will be used. In relational databases there is usually one correct data-design.

Schema Design is Flexible

  • Choices for schema design
  • Each record can have different fields (we can store documents without being exactly the same). I can just use a certain field in only 10% of all the documents in the collection
  • Easy to evolve as needed. The DB doesn't force any schema, you can just simply add new fields.

Two Building Blocks of Document Schema Design 

1 - Arrays

An ordered list of individual items. Can be numbered, etc or embedded arrays. Like JSON. Similiar to JSON with a broader array of types.

In general we want to user arrays when we have multiple values per Field. Every field in a document can be:
  • absent
  • null
  • single value
  • array of many values
Arrays can be indexed and queried in what's in array.
> Tell MongoDB to find the document that has a certain value in an array
It can be extremely powerful.

2- Embedded Documents

Any field can be in fact another document. These is not foreign keys. The document is not somewhere else on the disk but embedded. You can retrieve the document with a single read without doing any extra joins, or queries.
Use it when we want to apply structure to something that is relational.

You can query any field at any level (can be indexed). "Find me a document that contains a document that contains a key that has a certain value"
You can do it at any depth (like 12 levels deep)

What is an Entity?

It's an object in your model that has associations other entities.

Referencing (Relational)   =>   Embedding (Document)
has_one                              =>    embeds_one (1:1 relationship)
belongs_to                          =>    embedded_in (1:n a foreign key that points to another)
has_many                           =>    embeds_many (a single record associated with many records)
has_and_belongs_to_many  =>  

Let's model something

Business Card

1. Referencing
two objects Contacts and Addresses
Contacts:
​{ _id: 2, name:"Steven Jobs", title:"New Product Development", ..., address_id:1}
Addresses:
{_id: 1, street:"10260 Bandley Dr", city:"Cupertino",...}
_id  every document is required
address_id: foreign key to link an address

2. Embedding

Address object is embedded in Contact

Differences between the two

Embedded doc has excellent data locality 
What if someone has the same address (two employees of the same company)? I would need to update the address in several places. Not a problem in relational.

Depends on the application you are developing.

Schema Flexibility

Easy to add an extra field like email address to a document to a new business card.
In relational it would be more complicated (lots of extra null value cells)

More detailed example

Address Book

  • What questions do I have? (user cases)
  • What are my entities?
  • What are my associations?
Typical address entities and relationships:

Contacts (name, company, title)
Groups (name) n:n (with Contacts – one contact can belong to n group, and n group can contain the contact)
Twitters (name, location, web, bio) 1:1
Thumbnails small rez image (,mime_type, data) 1:1
Portraits high rez image (mime_type, data) 1:1
Addresses (type, street, city) 1:n
Phones (type, number) 1:n
Email (type, address) 1:n

Schema

1:1 relationships first Twitter – Contact

Relational style: track the other with either a ​twitter_id​ or ​contact_id​ (one of these, redundant relationship on both sides: Both references must be updated for consistency)
Embedded: Embed the twitter info in Contact (mark in the design with a "1" that we're expecting one element)

Analysing the embedded solution

  • When we retrieve the Contact we have all of the information
  • ​embeds_one
  • can query or index on embedded field
  • exceptional case: storing the high resolution portrait image would not be efficient (we don't want to get it every time)

1:n relationship Phones Contact

Relational style: in Contact create a ​phone_ids: []​ array (not possible cleanly in true RDB) or add a ​contact_id​ in the Phone entity.
Embedded: array field embedded

Contact embeds multiple phones. No additional data duplication, can query or index on any field
Exceptional case: max doc size 16MB. 

1:n relationship Groups – Contact

Relational style: A join table (in real RDB). In MongoDB, use arrays (each group has an array of _ids). 
Embedded style: It depends on how you're expecting to look up elements. By contact or by group, then embed accordingly. 
E.g. you often want to send emails to a Group (embed Contacts in the Group)

Exceptional cases: 16MB and scaling issues (? not clear from the presentation, around 31minutes mark)

What we have so far with these solution

Contacts embeds: Twitter, Thumbnail, Addresses, Phones, Emails
References: Portraits, and Groups (array of group _ids)
Shows example how it would look like as a MongoDB document: JSON formatted

Working Set

90/10 rule : Keep 90% of the data I access most often in memory. Reduce working set by:
  • reference large, bulk data, e.g. portrait photo
  • reference less-used date instead of embedding 

General Recommendations

Legacy Migration

Migrating from relational systems:
  • just copy the data (works but doesn't take advantage of MongoDB)
Or migrate and adopt your schema:
  1. start with 1:1 associations first
  2. then 1:n 
  3. then n:n
New Software Application? Embed by default!
Only reference for above reasons (90/10 rule above)

Embedding over Referencing

Embedding is a bit like pre-joined data. BSON (Binary JSON) document operations are easy for the server. 

Embed (90/10 rule)
Decide which is the child-parent object (which is viewed more often in the context of the other)
Embed will always be faster then multiple fetches! 

Reference advantages
  • scaling (large files)
  • better consistency with "many to many" associations with duplicated data

It's all about your application

There is no one right answer. Schema design should depend on what your users are going to do and how they are going to behave. 

Questions

- MongoDB vs Cassandra?
Both non-relational DBs. Cassandra multiple-key value schema

- Joining two collections
No join in MongoDB. It's two round-trips when you want to do an equivalent of a join.

- 16MB limit is going to change?
No plans to change, but there are workarounds: GridFS. pseudo-files system.

- _ids in an embedded array
possible

- oracle to mongodb export
through csv, json, xml then mongodb import tool