Mongodb Thinking in Documents webinar by Mike Friedman
Webinar: Back to Basics 1: Thinking in Documents | MongoDB
Mike Friedman
Agenda
- What is a Record?
- Core Concepts
- What is an Entity?
- Associating Entities
- General Recommendations
Why?
He argues that "All application development is Schema Design"
The software developers job is first to design your data model. Even if you don't even store data.
"Success in development comes down to proper data structure."
True if data is in memory or on disk
What is a record?
Key -> Value
Key -> Blob
Single key is associated with a single value which is a blob. The storage system (db) has no knowledge what is inside this blob. It's an opaque chunk of data.
We can only query on the key. The above means that the blob can only be replaced not update (data storage has no knowledge of the contents of the blob).
Relational
Tables with rows and columns. You can query by any column.
Two-dimensional storage (tuples). Each field contains a single value and you can query on any field.
Very structured schema, you can do in-place updates - update any cell.
Disadvantage: Normalization requires lost of tables, indexes, joins and poor locality
On highly distributed system this can become a bottleneck
Document model
N-dimensional storage - we can have as many dimensions we want. Each field can contain 0,1 or many values or embedded values (other documents or arrays).
In MongoDB we can query on any field at any level (embedded values).!
I can look documents up by reaching down any documents down. The deeply nested structure is transparent.
The schema is flexible - not every document has to look the same. In general we want them to look similar. Don't have to redundantly represent every piece of data in every document where it doesn't apply.
Do inline updates - including a single value in an embedded document
Optimal data locality - we can get all the data we need from the database without doing multiple queries (connections) or joins
Core Concepts
Basic review of relational databases.
Traditional schema design focuses on primarily on data storage: What are the atomic single pieces of data that we can isolate and put in a place so there's no duplication?
In documents based design: the focus is how the data is used? What questions my users are going to have that I will have to answer in an efficient way?
To compare:
Traditional: What answers do I have?
Document based: What questions do I have?
It's these question that guide the design of the schema. For the same data their might multiple correct answers based on how the data will be used. In relational databases there is usually one correct data-design.
Schema Design is Flexible
- Choices for schema design
- Each record can have different fields (we can store documents without being exactly the same). I can just use a certain field in only 10% of all the documents in the collection
- Easy to evolve as needed. The DB doesn't force any schema, you can just simply add new fields.
Two Building Blocks of Document Schema Design
1 - Arrays
An ordered list of individual items. Can be numbered, etc or embedded arrays. Like JSON. Similiar to JSON with a broader array of types.
In general we want to user arrays when we have multiple values per Field. Every field in a document can be:
- absent
- null
- single value
- array of many values
> Tell MongoDB to find the document that has a certain value in an array
It can be extremely powerful.
2- Embedded Documents
Any field can be in fact another document. These is not foreign keys. The document is not somewhere else on the disk but embedded. You can retrieve the document with a single read without doing any extra joins, or queries.
Use it when we want to apply structure to something that is relational.
You can query any field at any level (can be indexed). "Find me a document that contains a document that contains a key that has a certain value"
You can do it at any depth (like 12 levels deep)
What is an Entity?
It's an object in your model that has associations other entities.
Referencing (Relational) => Embedding (Document)
has_one => embeds_one (1:1 relationship)
belongs_to => embedded_in (1:n a foreign key that points to another)
has_many => embeds_many (a single record associated with many records)
has_and_belongs_to_many =>
Let's model something
Business Card
1. Referencing
two objects Contacts and Addresses
Contacts:
{ _id: 2, name:"Steven Jobs", title:"New Product Development", ..., address_id:1}
Addresses:
{_id: 1, street:"10260 Bandley Dr", city:"Cupertino",...}
_id every document is required
address_id: foreign key to link an address
2. Embedding
Address object is embedded in Contact
Differences between the two
Embedded doc has excellent data locality
What if someone has the same address (two employees of the same company)? I would need to update the address in several places. Not a problem in relational.
Depends on the application you are developing.
Schema Flexibility
Easy to add an extra field like email address to a document to a new business card.
In relational it would be more complicated (lots of extra null value cells)
More detailed example
Address Book
- What questions do I have? (user cases)
- What are my entities?
- What are my associations?
Contacts (name, company, title)
Groups (name) n:n (with Contacts – one contact can belong to n group, and n group can contain the contact)
Twitters (name, location, web, bio) 1:1
Thumbnails small rez image (,mime_type, data) 1:1
Portraits high rez image (mime_type, data) 1:1
Addresses (type, street, city) 1:n
Phones (type, number) 1:n
Email (type, address) 1:n
Schema
1:1 relationships first Twitter – Contact
Relational style: track the other with either a
twitter_id
or
contact_id
(one of these, redundant relationship on both sides: Both references must be updated for consistency)Embedded: Embed the twitter info in Contact (mark in the design with a "1" that we're expecting one element)
Analysing the embedded solution
- When we retrieve the Contact we have all of the information
embeds_one
- can query or index on embedded field
- exceptional case: storing the high resolution portrait image would not be efficient (we don't want to get it every time)
1:n relationship Phones – Contact
Relational style: in Contact create a
phone_ids: []
array (not possible cleanly in true RDB) or add a contact_id
in the Phone entity.Embedded: array field embedded
Contact embeds multiple phones. No additional data duplication, can query or index on any field
Exceptional case: max doc size 16MB.
1:n relationship Groups – Contact
Relational style: A join table (in real RDB). In MongoDB, use arrays (each group has an array of _ids).
Embedded style: It depends on how you're expecting to look up elements. By contact or by group, then embed accordingly.
E.g. you often want to send emails to a Group (embed Contacts in the Group)
Exceptional cases: 16MB and scaling issues (? not clear from the presentation, around 31minutes mark)
What we have so far with these solution
Contacts embeds: Twitter, Thumbnail, Addresses, Phones, Emails
References: Portraits, and Groups (array of group _ids)
Shows example how it would look like as a MongoDB document: JSON formattedWorking Set
90/10 rule : Keep 90% of the data I access most often in memory. Reduce working set by:
- reference large, bulk data, e.g. portrait photo
- reference less-used date instead of embedding
General Recommendations
Legacy Migration
Migrating from relational systems:
- just copy the data (works but doesn't take advantage of MongoDB)
- start with 1:1 associations first
- then 1:n
- then n:n
Only reference for above reasons (90/10 rule above)
Embedding over Referencing
Embedding is a bit like pre-joined data. BSON (Binary JSON) document operations are easy for the server.
Embed (90/10 rule)
Decide which is the child-parent object (which is viewed more often in the context of the other)
Embed will always be faster then multiple fetches!
Reference advantages
- scaling (large files)
- better consistency with "many to many" associations with duplicated data
It's all about your application
There is no one right answer. Schema design should depend on what your users are going to do and how they are going to behave.
Questions
- MongoDB vs Cassandra?
Both non-relational DBs. Cassandra multiple-key value schema
- Joining two collections
No join in MongoDB. It's two round-trips when you want to do an equivalent of a join.
- 16MB limit is going to change?
No plans to change, but there are workarounds: GridFS. pseudo-files system.
- _ids in an embedded array
possible
- oracle to mongodb export
through csv, json, xml then mongodb import tool