Introduction to NoSQL Document Databases

Document databases are intended for semi-structured data. Implementations differ widely in architecture and functionality but all coalesce around the central abstract concept of a Document (apparently inspired by the venerable Lotus Notes).

A common analogy of a Document is that of the DOM. DOM elements are composed of identifiers in the form of a label and an associated object.

The distinction between a Document and a collection of Keys and Values aggregated into a View, is subjective and prone to semantic emphasis. From a developers perspective, the distinction becomes irrelevant once abstracted behind an API or Query Language.

Like KV stores, a query of a Document Database is considered a View. Views are optimized via cached indexing. Both need to be periodically refreshed and rebuild. View are analogous to Dimensions in a traditional relational database context.

The concept of a Document removes any logical distinction between the document and its meta data. Meta data is promoted to a first class object in the DOM analogy.

Document objects typically conform to a schema, though the schema is rarely explicit. A schema may be unique to each Document in the store. The rigidity of schema is not  determined by the Document database but by the designer and the shape and uniformity of the data. Document Databases are schema-agnostic rather than schema-less.

Documents are portable and interoperable via XML, JSON, YAML, BSON. The Document Database typically supports familiar document schemas like, DTD, JATS, HTML, XSD but doesn’t strictly enforce them.

Individual documents are identified and accessed via a unique key. Typically a URI path expressed as a string. Keys are indexed for performance like any other Database. This URI tree is not dissimilar to LDAP DNs.

Documents may be organized into Collections, Buckets or Hierarchies depending on the semantics of the implementation. In some implementations, documents can contain hierarchical Collections of sub-Documents. This suits continuous, small, volatile read and write operations.

Distinguishing a Document Database from a Persistent Object Database or an XML Database is not easy. All can legitimately be assigned to the NoSQL class of Databases. To complicate matters, many relational databases now support exotic data types like XML, GeoJSON and executable blobs. The differentiator is subtle differences in emphasis in CAP Theorem and ACID compliance.

Document Databases compromise Consistency in favor or Availability and Partition tolerance and and are generally not ACID compliant (with exceptions like OrientDB and DocumentDB).

Selecting a Document Database over a Relational Database is guided by application specific CAP and ACID requirements. Relational databases excel at Consistency and if this is an overriding priority along with ACID compliance than Document Databases are probably not a fit.

Each Document Database implementation places a subtly different emphasis on Availability and Partition tolerance. Eventual Consistency is a hallmark of distributed Document Databases. The lack of guaranteed immediate consistency, implies these are Document data stores rather than databases. This remains a largely semantic debate.

The favored means of accessing Document Databases are RESTful APIs. Familiar CRUD transactions remain and existing proxy and load balancing systems can be exploited. While the REST API is not universal it is increasingly popular.

CouchDB and MongoDB provide language specific APIs, abstracting TCP connection strings and session negotiation as a convenience for developers. RavenDB offers a .Net client APIClearly, these APIs are written for software developers rather than operational IT DBAs and Reporting Analysts. The developer friendly REST only option, remains a barrier to enterprise adoption. With time this will change as demand for additional operational and integration interfaces gain traction.

While Relational Databases are capable of storing shaped data like trees, nodes and vertices, it is cumbersome at best and querying them usually requires precomputed Dimensions and Indexes. Document databases are well suited to heavily nested, hierarchical datasets involving nodes of factors, vectors, arrays and blobs with rich expressive vertices.

CouchDB and RavenDB store Documents as JSON objects. MongoDB extends this to Binary JSON (BSON) allowing binary serialization. JSON is favored for the ease with which an object can be transposed, eliminating the need for object-relational mappers to translate between relational schemas and hierarchical class and object schema (ORMs like ActiveRecord and DataMapper).

Pivoting a relational table typically involves expensive Union operations in ANSI compliant SQL, or implementation specific extensions like those found in TSQL. This friction is eliminated when it can be performed natively via the API after declaring the attributes of the class or object to return.

In cases like Microsoft’s hosted Azure DocumentDB, JavaScript is the query language and hosted in the same memory space as the database. Subsequently Triggers and Stored Procedures written in JavaScript become synonymous with functions and execute in the same scope as the database session. This guarantees ACID for all operations but traditional separation of concerns embodied in patterns like MVC are not enforced in this implementation. While cloud hosted, the developer experience is of a tightly integrated, embedded, in memory Document database.

Amazon SimpleDB offers a proprietary but SQL-looking query language that will be familiar to DBAs and Web developers. It’s a simple data store with no ambition to embed business logic in triggers or stored procs. This slimmed down approach will appeal web, mobile and MVC advocates.

CouchDB offers Update Handlers and Change Notifiers for trigger and stored proc functionality. These server side Erlang functions are accessed via the REST API. CouchDB gained early developer acceptance and more recently credibility, thanks to the governance of the Apache Foundation. Many of the original development team have moved to Couchbase, forking their original work. Their goal is to replace performance sensitive Erlang subsystems with native C and C++ , while maintaining Memcached and CouchDB compatibility. While still Apache licensed it is clearly targeting the enterprise with Couchbase Server.

Unlike SQL query optimizers, CouchDB exploits Erlangs concurrency capabilities aiming for uniform performance at scale rather than maximizing performance of each query.

Touring-complete languages, DSLs and Domain Driven Development practices guide the planning of Domain Classes. These easily translate to Documents and Collections.

For example, a patient Document (or Class) might contain a Collection of Lab Documents (or Attributes) with patient specific results (or Objects). Getter and Setter methods translate via the API to GET and PUT requests and then to Document transactions as reads and writes. The conceptual translation is elegant.

Joining a group of patient Documents that have had a particular Lab test, implies existence of relationships and foreign keys. Each Document Database implementation offers different guidance on the various language specific patterns to facilitate this. MongoDB offers a pattern called Array of Ancestors to enable these sorts of Joins. The developer investment in learning these database specific patterns is non-trivial and beyond the reach of most DBAs.

Such joins always raise concerns about repeating data and Relational Databases normalize to mitigate this. In Document Databases, denormalizing is useful, acceptable and even encouraged in places.

Is NoSQL worth the attention of Enterprise IT?

Formulating a definition of NoSQL that ages gracefully is challenging. Current usage encompasses the NoSQL technology set, the wider data lake movement and the NoSQL community with the inevitable inner circle of technologists and thought leaders.

As with all potentially disruptive technologies there’s also an emerging anti-NoSQL movement masquerading as the NewSQL movement. It includes its own tech celebrities, threatened interests and established and emerging product lines.

A short and simple definition a CIO can take to a budget review board, involves using terms like Big Data and Data Science. If we’re going to use these terms, we’re best served describing NoSQL as solution set for a specific class of business problem.

Mainstream IT observers perceive NoSQL as a nascent technology ecosystem. Even technical, well informed analysts struggle to stay abreast of the myriad technical differentiators; many of which are semantic reinterpretations of the same thing. However, all agree this is a fragmented and rapidly evolving market, with a few high profile early adopters and many vociferous evangelists.

Distinguishing between genuine pioneers and fast-followers depends largely on professional relationships and insider knowledge. So the wider jury is still out on whether this technology has the potential to disrupt and ultimately transform mainstream enterprise IT departments. Most independent commentators are suggesting Business Intelligence and Data Sciences will be the eventual beneficiary of the NoSQL movement.

There is no doubting the deep thought leadership and advanced technology embodied in NoSQL. Mass market acceptance is waiting for a sustained and verifiable track record of solved business problems or undisputed attribution of incremental sales or accretive revenues.

NoSQL generally is characterized as an emerging class of data stores intended for problems ill-suited to Relational Database Management Systems (RDBMS). NoSQL data stores were conceived to address a problem space that did not emerge until the advent of the massively connected internet age.

What is this problem space?

This set of problems are characterized as involving large volumes of very diverse data. This data is typically, nested, hierarchical and highly interconnected. Meaning the nature of the connections hold as much if not more importance, than the objects themselves. Meta data becomes indistinguishable from transactional data. This data lends itself to inferential statistical analysis and machine learning.

Holding this sort of data in a columnar or tabular form (tables of records and fields) is cumbersome and inefficient. RDMBS are best suited to transactional rows of records and in particular OLTP use cases. Collecting, aggregating and synthesizing information from relationships stored in this form is complex. In the RDBMS realm this gives rise to very complex joins and brittle SQL queries that take hours if not days to run. Tuning indexes and optimizing query plans only adds to code complexity and the subsequent maintenance overhead. This sort of reporting query quickly takes on a batch nature and bleeds into OLAP use cases.

Data types in these heavily nested, hierarchical datasets might include factors, vectors and even associative arrays. These unorthodox data types may be part of long chains of evolving relationships, and patterns of determinants that are not known when the data store is established. Essentially making a conventional schema design an almost impossible challenge for application developers.

So what sort of data could possibly look like this?

A health care record would be considered hierarchical with the patient at the root. Credit card transaction data with multiple card holders (say, husband and wife) would have the account number as the root node and the transactions of each spouse as a nested sub tree. Law enforcement build networks of criminal associations and activities to identify or infer connections. Telecom carriers associate customers with preferred contacts and call patterns to optimize pricing models. Amazon tracks the purchasing habits of “Customers Like You” to up-sell your shopping experience. And ultimately, Facebook can tell their advertisers what color underwear a particular demographic likes to wear on a Monday in January.

It’s important to note that insights like this come not from the NoSQL data store technology itself but the inferential statistics and analysis they enable. They significantly reduce the Extract, Transform, Load (ETL) burden on data miners and statistical analysts.

What are the advantages?

The most obvious advantage of NoSQL data stores is the schemaless tree architecture, unbounded horizontally and vertically making it versatile and promiscuous. A conceptual comparison to LDAP directory forests is reasonable and some would suggest, suspiciously so.

Hierarchical, flat file data stores have been in common use since the 1950’s (ref MUMPS, IBM VSAM). The contemporary differentiator is the surrounding ecosystem of modern query and programming languages. In particular the suite of HTML5 standards and specifically JavaScript (ECMAScript) and JSON.

A mature sub-industry exists to bridge the gap between serious developers (and their serious C-like programming languages) and the query languages favored by Administrators and Reporting Analysts trained in SQL-like scripting and querying.

Object Resource Mappers and Persistent Object Caches exist to make relational data stores appear and behave in an Object Oriented fashion. This abstraction layer of frameworks and libraries is unnecessary when NoSQL queries return native JSON or XML objects. The former favored by younger web-oriented developers and the latter the older ‘curly bracket’ ({}) crowd (C, C++, C# and JAVA).

Herein lies an important division between two communities – the software engineers and the ops-oriented DBAs and Reporting Analysts. The former seek the familiarity of an Object Oriented programmers paradigm and the latter the relative simplicity of function oriented scripting and query languages. Big Data Scientists sit firmly in the middle, happy to be free of conventional constraints like schema and willing to pay the price by working in an object oriented paradigm (such as R for Statistical Computing).

How to choose?

Selecting a NoSQL solution is daunting. There are a large number of well supported Open Source projects, many sponsored by credible and generous commercial patrons. There are strictly proprietary, commercially licensed product offerings from innovative startups to the major household names. There are hybrids- freemium offerings from professional service firms consulting on their preferred Open Source Project.

A systematic evaluation of the technology set, concepts, architectures, approaches and patterns is a non-trivial but viable proposition. It comes down to a systems engineering study evaluating a couple of dozen key “Products” and their commercial risk (maturity, support, roadmap etc.).

Broadly speaking NoSQL data stores fall into two classes. Graph Stores (GS) and Key-Value Stores (KV). Document Stores can be considered a sub-type of the Key-Value Store.

A third class called Wide Column Stores align more closely with the NewSQL movement that I’ll treat as out of scope for now.

Choosing between GS and KV depends entirely on the nature of the business problem and shape of the data involved. However, selecting a good GS or KV involves considering some common factors. Namely maturity and penetration.

Maturity would seem to be subjective for such a new class of technology. NoSQL projects can be evaluated on a number of factors that can be weighted and normalized to reflect the nature of the business problem or enterprise culture.

  1. size of the developer community
  2. number and frequency of code commits
  3. number and stability of prior releases
  4. number of open issues and closure rate
  5. credibility or availability of a product road map
  6. structure of project governance and decision making
  7. query language accessibility

Most of these items can be gleaned from the project GitHub repository and StackOverflow activity.

Enterprise penetration is largely driven by the type of release cycle. Fixed Schedule release cycles versus Fixed Feature release cycles can significantly influence the perceived level of adoption risk.

Health Care Population Health

Populations are connected. There are no isolated pieces of information, but rich, connected domains. Any citizen can be related in any number of ways. Family ties. Lifestyle preferences. Hereditary predispositions. Medical conditions. Purchasing preferences. The strength and patterns of these relationships are the primary determinant of a populations overall health.

A citizen predisposed to lousy health is likely to have a better quality of life if part of a healthy population; and the inverse holds true (an untested, personal hypothesis!).

Collecting, aggregating and synthesizing information from millions of relationships is complex. Performing this on sub-populations is a task ill suited to traditional tools and approaches.

The emergence of Network Graph[^1] stores has been driven by Social Network Analysis (SNA) to solve problems in Public Health, Law Enforcement, Retailing, Economics and Social Sciences. Little attention has been focused on Population Health in a private, accountable health care context.

Challenged with managing the health of a population, could we conceive of all the necessary data points and store them in a tabular datastore like an Excel spreadsheet? Do all these data points exist in the EMR?

Does this challenge involve long chains of evolving relationships and patterns of interaction and determinants? Might these include contextual, non-clinical factors? These would seem to be essential considerations in an epidemiologic or risk analysis of a population.

The patient population is increasingly hetrogeneous. Increasing social and geographic mobility, fuels patient churn

So what?

The vendors circling hungrily above all health provider systems have yet to offer a credible Population Health Management solution (personal opinion!). Credibility would seem to come from any two of the following

  1. Domain thought leadership
  2. Advanced, breakthrough technology
  3. Quantitative track record of success

Clearly many millions of dollars have been invested in product and service development and this has convinced many health systems to place an investment bet on them. All involve considerable barriers to entry with few guarantees of ROI. These early customers are ‘fast followers’ of emerging trends and heralded as progressive leaders by vendors seeking out other followers.

While no clear solution has emerged, the Population Health mandate and Quality Measures continue closing in on us. The clock continues to tick on our window for informed discussion and decision making.

If no vendor steps forward with a compelling, credible, affordable solution to our problem, what do Clinical Intelligence Analysts do? And even if one does, how do we hedge our risk on them?

Plan & Prepare

As we plan, we are forced to consider our definition of “Population Health” (the knowledge domain) as this informs our “Management” strategy (tools, processes, targets).

Health System Administrators are legitimitly inclined towards a business-first definition focused on CMS Quality Outcome Measures.

A business-first definition would encourage optimization of our operating model, consistent with general high volume, low margin, service businesses. The evoluton of service industries suggest adoption of multi-class service models to accommodate population outliers (Hotels, Airlines, Banking, Automotive, Retail offer some inisght).

An alternative viable option is an academic, clinical, science or mission-first definition. This would seem to be a less obvious choice that would be a broader superset of the business-first definition.

Such a choice would encourage consideration of network and psychographic factors of the population. This would seem to be necessary to affect the behavior changes evidence suggests is needed for improvement in longer term health outcomes.

If a Health Systems chooses a definition broader than business-first, it will plan, prepare and prioritize differently. Subsequently, needs shift from near time reporting to forecasting, anticipating, reasoning and prediction. These are skills and competencies that will need to be developed across the continuum of care, irrespective of facility size.

Interactive Activity Heat Maps with Meditech Data


Cal-Heatmaps wih CSV data

A simple set of multivariate, time series heat maps using cal-heatmapsd3.js and data from the Meditech Dirty Repository.

Code now at the Github Repo (I’ve abandoned the infuriating and frankly futile process of trying to embed and format code blocks here in

The trailing seven days of data is illustrated by day and summarized by hour of each day.

Each cell represents an hour and the roll-over tool tip summarizes the activity for that hour.

The data source is a series of flat files (CSV) exported from the Meditech DR. You can automate this CSV Export with some PowerShell.


Save this code with a .PS1 file extension and schedule it to execute with either Windows Task Scheduler or a SQL Server Agent Job.

If you want to connect the visualization to a live query and feed the visualization directly from the DR, see this post –

GitHub Setup for Windows in Corporate Environments

These steps are necessary to get Git working with GitHub from behind a corporate firewall. These steps are for a Microsoft Windows centric environment.

Installing the Windows Git Package

Download the windows installer from then run the installer with the recommended default settings. You’ll need sufficient privileges to install applications at your desktop (condolences if you don’t).

SS1  SS2 SS3 SS4

Configure Git Proxy Settings

Locate the .gitconfig file in your users home directory and open with a text editor. If you have roaming profiles, it can be challenging to find this as it may be mapped to a shared network drive.

You can locate the file by identifying the home directory with the SET command in the DOS Command Window. In this example it happens to be on a network share mapped to the V: drive.

Make a note of your system Proxy Server settings here.


Looking at my .config file, there are six data points needed to configure git to be able to pass through my Proxy Server.

You can do this configuration from the command line with git configure –global but it feels simpler to just edit it here with your text editor.


  1. Your NTLM or Windows domain followed by a back slash (\), which has to be escaped so two slashes (\\)
  2. Your Windows Domain Username followed by a colon (:)
  3. Your Windows Domain Password (remember to escape any special characters) followed by an at sympol (@)
  4. Your Proxy Server address or hostname followed by a colon (:)
  5. Your Proxy Server TCP port number
  6. This turns off SSL Certificate verification as this proxy substitutes signed certificates with its own self-signed version. Presumably in order to allow admins to inspect what you’re transmitting!

String it all together and you get something like


Repeat these settings for the https directive below.

Setup a GitHub Repository

Go back to and create a new Repository. I’ve called mine Git_Setup.


Once it has bee created be sure to select the HTTP tab. If you fail to do this, instructions will assume you can perform outbound SSH from your host. Typical corporate Proxy and Firewall settings will deny this. So be sure to understand, you will be using Git over HTTPS through your Proxy Server.

You should then be able to perform the instructions identified in the red box from your Dos Command Window.


Seems to work for me…