Is NoSQL worth the attention of Enterprise IT?

Formulating a definition of NoSQL that ages gracefully is challenging. Current usage encompasses the NoSQL technology set, the wider data lake movement and the NoSQL community with the inevitable inner circle of technologists and thought leaders.

As with all potentially disruptive technologies there’s also an emerging anti-NoSQL movement masquerading as the NewSQL movement. It includes its own tech celebrities, threatened interests and established and emerging product lines.

A short and simple definition a CIO can take to a budget review board, involves using terms like Big Data and Data Science. If we’re going to use these terms, we’re best served describing NoSQL as solution set for a specific class of business problem.

Mainstream IT observers perceive NoSQL as a nascent technology ecosystem. Even technical, well informed analysts struggle to stay abreast of the myriad technical differentiators; many of which are semantic reinterpretations of the same thing. However, all agree this is a fragmented and rapidly evolving market, with a few high profile early adopters and many vociferous evangelists.

Distinguishing between genuine pioneers and fast-followers depends largely on professional relationships and insider knowledge. So the wider jury is still out on whether this technology has the potential to disrupt and ultimately transform mainstream enterprise IT departments. Most independent commentators are suggesting Business Intelligence and Data Sciences will be the eventual beneficiary of the NoSQL movement.

There is no doubting the deep thought leadership and advanced technology embodied in NoSQL. Mass market acceptance is waiting for a sustained and verifiable track record of solved business problems or undisputed attribution of incremental sales or accretive revenues.

NoSQL generally is characterized as an emerging class of data stores intended for problems ill-suited to Relational Database Management Systems (RDBMS). NoSQL data stores were conceived to address a problem space that did not emerge until the advent of the massively connected internet age.

What is this problem space?

This set of problems are characterized as involving large volumes of very diverse data. This data is typically, nested, hierarchical and highly interconnected. Meaning the nature of the connections hold as much if not more importance, than the objects themselves. Meta data becomes indistinguishable from transactional data. This data lends itself to inferential statistical analysis and machine learning.

Holding this sort of data in a columnar or tabular form (tables of records and fields) is cumbersome and inefficient. RDMBS are best suited to transactional rows of records and in particular OLTP use cases. Collecting, aggregating and synthesizing information from relationships stored in this form is complex. In the RDBMS realm this gives rise to very complex joins and brittle SQL queries that take hours if not days to run. Tuning indexes and optimizing query plans only adds to code complexity and the subsequent maintenance overhead. This sort of reporting query quickly takes on a batch nature and bleeds into OLAP use cases.

Data types in these heavily nested, hierarchical datasets might include factors, vectors and even associative arrays. These unorthodox data types may be part of long chains of evolving relationships, and patterns of determinants that are not known when the data store is established. Essentially making a conventional schema design an almost impossible challenge for application developers.

So what sort of data could possibly look like this?

A health care record would be considered hierarchical with the patient at the root. Credit card transaction data with multiple card holders (say, husband and wife) would have the account number as the root node and the transactions of each spouse as a nested sub tree. Law enforcement build networks of criminal associations and activities to identify or infer connections. Telecom carriers associate customers with preferred contacts and call patterns to optimize pricing models. Amazon tracks the purchasing habits of “Customers Like You” to up-sell your shopping experience. And ultimately, Facebook can tell their advertisers what color underwear a particular demographic likes to wear on a Monday in January.

It’s important to note that insights like this come not from the NoSQL data store technology itself but the inferential statistics and analysis they enable. They significantly reduce the Extract, Transform, Load (ETL) burden on data miners and statistical analysts.

What are the advantages?

The most obvious advantage of NoSQL data stores is the schemaless tree architecture, unbounded horizontally and vertically making it versatile and promiscuous. A conceptual comparison to LDAP directory forests is reasonable and some would suggest, suspiciously so.

Hierarchical, flat file data stores have been in common use since the 1950’s (ref MUMPS, IBM VSAM). The contemporary differentiator is the surrounding ecosystem of modern query and programming languages. In particular the suite of HTML5 standards and specifically JavaScript (ECMAScript) and JSON.

A mature sub-industry exists to bridge the gap between serious developers (and their serious C-like programming languages) and the query languages favored by Administrators and Reporting Analysts trained in SQL-like scripting and querying.

Object Resource Mappers and Persistent Object Caches exist to make relational data stores appear and behave in an Object Oriented fashion. This abstraction layer of frameworks and libraries is unnecessary when NoSQL queries return native JSON or XML objects. The former favored by younger web-oriented developers and the latter the older ‘curly bracket’ ({}) crowd (C, C++, C# and JAVA).

Herein lies an important division between two communities – the software engineers and the ops-oriented DBAs and Reporting Analysts. The former seek the familiarity of an Object Oriented programmers paradigm and the latter the relative simplicity of function oriented scripting and query languages. Big Data Scientists sit firmly in the middle, happy to be free of conventional constraints like schema and willing to pay the price by working in an object oriented paradigm (such as R for Statistical Computing).

How to choose?

Selecting a NoSQL solution is daunting. There are a large number of well supported Open Source projects, many sponsored by credible and generous commercial patrons. There are strictly proprietary, commercially licensed product offerings from innovative startups to the major household names. There are hybrids- freemium offerings from professional service firms consulting on their preferred Open Source Project.

A systematic evaluation of the technology set, concepts, architectures, approaches and patterns is a non-trivial but viable proposition. It comes down to a systems engineering study evaluating a couple of dozen key “Products” and their commercial risk (maturity, support, roadmap etc.).

Broadly speaking NoSQL data stores fall into two classes. Graph Stores (GS) and Key-Value Stores (KV). Document Stores can be considered a sub-type of the Key-Value Store.

A third class called Wide Column Stores align more closely with the NewSQL movement that I’ll treat as out of scope for now.

Choosing between GS and KV depends entirely on the nature of the business problem and shape of the data involved. However, selecting a good GS or KV involves considering some common factors. Namely maturity and penetration.

Maturity would seem to be subjective for such a new class of technology. NoSQL projects can be evaluated on a number of factors that can be weighted and normalized to reflect the nature of the business problem or enterprise culture.

  1. size of the developer community
  2. number and frequency of code commits
  3. number and stability of prior releases
  4. number of open issues and closure rate
  5. credibility or availability of a product road map
  6. structure of project governance and decision making
  7. query language accessibility

Most of these items can be gleaned from the project GitHub repository and StackOverflow activity.

Enterprise penetration is largely driven by the type of release cycle. Fixed Schedule release cycles versus Fixed Feature release cycles can significantly influence the perceived level of adoption risk.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s