Introduction to NoSQL Document Databases

Document databases are intended for semi-structured data. Implementations differ widely in architecture and functionality but all coalesce around the central abstract concept of a Document (apparently inspired by the venerable Lotus Notes).

A common analogy of a Document is that of the DOM. DOM elements are composed of identifiers in the form of a label and an associated object.

The distinction between a Document and a collection of Keys and Values aggregated into a View, is subjective and prone to semantic emphasis. From a developers perspective, the distinction becomes irrelevant once abstracted behind an API or Query Language.

Like KV stores, a query of a Document Database is considered a View. Views are optimized via cached indexing. Both need to be periodically refreshed and rebuild. View are analogous to Dimensions in a traditional relational database context.

The concept of a Document removes any logical distinction between the document and its meta data. Meta data is promoted to a first class object in the DOM analogy.

Document objects typically conform to a schema, though the schema is rarely explicit. A schema may be unique to each Document in the store. The rigidity of schema is not  determined by the Document database but by the designer and the shape and uniformity of the data. Document Databases are schema-agnostic rather than schema-less.

Documents are portable and interoperable via XML, JSON, YAML, BSON. The Document Database typically supports familiar document schemas like, DTD, JATS, HTML, XSD but doesn’t strictly enforce them.

Individual documents are identified and accessed via a unique key. Typically a URI path expressed as a string. Keys are indexed for performance like any other Database. This URI tree is not dissimilar to LDAP DNs.

Documents may be organized into Collections, Buckets or Hierarchies depending on the semantics of the implementation. In some implementations, documents can contain hierarchical Collections of sub-Documents. This suits continuous, small, volatile read and write operations.

Distinguishing a Document Database from a Persistent Object Database or an XML Database is not easy. All can legitimately be assigned to the NoSQL class of Databases. To complicate matters, many relational databases now support exotic data types like XML, GeoJSON and executable blobs. The differentiator is subtle differences in emphasis in CAP Theorem and ACID compliance.

Document Databases compromise Consistency in favor or Availability and Partition tolerance and and are generally not ACID compliant (with exceptions like OrientDB and DocumentDB).

Selecting a Document Database over a Relational Database is guided by application specific CAP and ACID requirements. Relational databases excel at Consistency and if this is an overriding priority along with ACID compliance than Document Databases are probably not a fit.

Each Document Database implementation places a subtly different emphasis on Availability and Partition tolerance. Eventual Consistency is a hallmark of distributed Document Databases. The lack of guaranteed immediate consistency, implies these are Document data stores rather than databases. This remains a largely semantic debate.

The favored means of accessing Document Databases are RESTful APIs. Familiar CRUD transactions remain and existing proxy and load balancing systems can be exploited. While the REST API is not universal it is increasingly popular.

CouchDB and MongoDB provide language specific APIs, abstracting TCP connection strings and session negotiation as a convenience for developers. RavenDB offers a .Net client APIClearly, these APIs are written for software developers rather than operational IT DBAs and Reporting Analysts. The developer friendly REST only option, remains a barrier to enterprise adoption. With time this will change as demand for additional operational and integration interfaces gain traction.

While Relational Databases are capable of storing shaped data like trees, nodes and vertices, it is cumbersome at best and querying them usually requires precomputed Dimensions and Indexes. Document databases are well suited to heavily nested, hierarchical datasets involving nodes of factors, vectors, arrays and blobs with rich expressive vertices.

CouchDB and RavenDB store Documents as JSON objects. MongoDB extends this to Binary JSON (BSON) allowing binary serialization. JSON is favored for the ease with which an object can be transposed, eliminating the need for object-relational mappers to translate between relational schemas and hierarchical class and object schema (ORMs like ActiveRecord and DataMapper).

Pivoting a relational table typically involves expensive Union operations in ANSI compliant SQL, or implementation specific extensions like those found in TSQL. This friction is eliminated when it can be performed natively via the API after declaring the attributes of the class or object to return.

In cases like Microsoft’s hosted Azure DocumentDB, JavaScript is the query language and hosted in the same memory space as the database. Subsequently Triggers and Stored Procedures written in JavaScript become synonymous with functions and execute in the same scope as the database session. This guarantees ACID for all operations but traditional separation of concerns embodied in patterns like MVC are not enforced in this implementation. While cloud hosted, the developer experience is of a tightly integrated, embedded, in memory Document database.

Amazon SimpleDB offers a proprietary but SQL-looking query language that will be familiar to DBAs and Web developers. It’s a simple data store with no ambition to embed business logic in triggers or stored procs. This slimmed down approach will appeal web, mobile and MVC advocates.

CouchDB offers Update Handlers and Change Notifiers for trigger and stored proc functionality. These server side Erlang functions are accessed via the REST API. CouchDB gained early developer acceptance and more recently credibility, thanks to the governance of the Apache Foundation. Many of the original development team have moved to Couchbase, forking their original work. Their goal is to replace performance sensitive Erlang subsystems with native C and C++ , while maintaining Memcached and CouchDB compatibility. While still Apache licensed it is clearly targeting the enterprise with Couchbase Server.

Unlike SQL query optimizers, CouchDB exploits Erlangs concurrency capabilities aiming for uniform performance at scale rather than maximizing performance of each query.

Touring-complete languages, DSLs and Domain Driven Development practices guide the planning of Domain Classes. These easily translate to Documents and Collections.

For example, a patient Document (or Class) might contain a Collection of Lab Documents (or Attributes) with patient specific results (or Objects). Getter and Setter methods translate via the API to GET and PUT requests and then to Document transactions as reads and writes. The conceptual translation is elegant.

Joining a group of patient Documents that have had a particular Lab test, implies existence of relationships and foreign keys. Each Document Database implementation offers different guidance on the various language specific patterns to facilitate this. MongoDB offers a pattern called Array of Ancestors to enable these sorts of Joins. The developer investment in learning these database specific patterns is non-trivial and beyond the reach of most DBAs.

Such joins always raise concerns about repeating data and Relational Databases normalize to mitigate this. In Document Databases, denormalizing is useful, acceptable and even encouraged in places.