A Conversation on NoSQL
In addition to the major movements around Service-Oriented Architecture (SOA), related areas of Enterprise Architecture, and Cloud Computing, are some major changes to the way that people deal with and interact with data. In particular, the data explosion that we’ve talked about in the context of our ZapThink 2020 vision results from not only the massive and continuing use of data in the enterprise, but the explosion of data creation and consumption in the general public sphere.
With user-generated content sites churning out all forms of rich media content, social networks and blogs generating entire Library of Congress-sized works on a daily basis, and constant streams of data emanating from devices of all shapes and sizes, the logarithmic progression of data’s footprint on the enterprise will only continue to progress unabated. Can today’s storage architectures, technologies, and infrastructures deal with this unending growth?
In the face of this eminent Crisis Point, a number of smart thinkers, innovative companies, and technologists are in the midst of recreating the way that data are used, consumed, stored, and managed. In this ZapFlash, we’ll dig into one such approach: the so-called “No SQL” data movement.
The Wikipedia entry on NoSQL defines it as such:
“NoSQL is a broad class of database management systems that differ from classic relational database management systems (RDBMSes) in some significant ways. These data stores may not require fixed table schemas, and usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage, a term that would include classic relational databases as a subset.”
I’ve had an opportunity sit down with a local expert on the topic, William Shulman, who has provided some high-level insight into the No SQL approach.
ZapThink: What’s the big hubbub about No SQL? What’s wrong with “Yes SQL”?
Shulman: The two main challenges the industry has faced with traditional RDBMS technologies are twofold:
-
The horizontal scaling problem: Traditional RDBMS systems are not easy to scale horizontally (i.e. via the clustering of many machines to increase data throughput). They were simply not architected from the ground up with horizontal scalability in mind from a systems standpoint. Furthermore, making the problem harder is the SQL query model itself. Because SQL has the concept of joins, any RDBMS clustering technology will ultimately run up against the “distributed join problem” (i.e. executing joins across nodes). This is inherently complex, and to solve the problem in a generic way that also performs well is very very difficult.
-
The “Object / Relational impedance mismatch problem”: Most languages used these data are Object Oriented. Therefore the shape of your data in your programming model is very different from its shape when stored in a relational database. A very large percentage of our time as programmers is traditionally spend changing the shape of our data from object form to relational form, and back, when writing an application. Some NoSQL solutions aim to solve this problem as well as the scalability problem by offering storage models that more closely mirror how data is represented in our programs. The Document stores /JSON object stores in particular aim to do this.
ZapThink: What does this have to do with Enterprise Architecture and SOA and Cloud?
Shulman: Cloud virtualization technology is really a great enabler for a lot of these NoSQL solutions. Most of these systems are distributed database systems, and so having technology that makes it easier to deploy and manage the large number of hosts that back the cluster is huge. Setting up and maintaining these systems without cloud virtualization technologies would be a lot more challenging.
ZapThink: Is this more than just a technology / infrastructure movement?
Shulman: I think it is. The database sits at the bottom of the stack. I think that we will find that changing that layer will eventually trigger changes in the stack elements that sit on top of the database, bubbling upwards through the entire app stack.
ZapThink: What’s Going on in the Vendor Landscape around No SQL?
Shulman: The database world has been pretty stagnant and boring for at least a decade and so the fact that so many new database technologies have emerged over the last several years is pretty amazing. Almost everyone I know who has been programming since the dawn of the “modern era” (as defined by the modern web application stack), has used relational database to store application data. For a long time the RDBMS system was the only game in town. That is changing. A database renaissance is taking place and it is pretty exciting. It is also confusing. With so many different “NoSQL” technologies out there it can be difficult to get your head around how they are different, and which is best suited for any particular task.
Interestingly almost all the NoSQL technologies out there are open source. I think that is a reflection of the overall infrastructure software landscape than of NoSQL in particular. InfoQ lists a great summary of the various technology approaches. Each different approach represents a very different take on NoSQL and problem they best solve.
ZapThink: Where do you see this all heading?
Shulman: I think the days of one-data-store-fits-all-problems are over. [Enterprises] are now realizing that a single data storage method and approach is not always the best fit and that they can (and should) employ different data storage technologies for different problems.
The ZapThink Take
In our recent “BASE Jumping in the Cloud” ZapFlash, we touched on some of these issues that are relevant to the No SQL approach:
“…in the Cloud we need a different way of thinking about consistency and reliability. Instead of ACID, we need BASE (catchy, eh?). BASE stands for Basic Availability (supports partial failures without leading to a total system failure), Soft-state (any change in state must be maintained through periodic refreshment), and Eventual consistency (the data will be consistent after a set amount of time passes since an update). BASE has been around for several years and actually predates the notion of Cloud computing; in fact, it underlies the telco world’s notion of “best effort” reliability that applies to the mobile phone infrastructure. But today, understanding the principles of BASE is essential to understanding how to architect applications for the Cloud.”
But even more than just dealing with the technical and infrastructural issue of data that the Data Explosion Crisis Point represents, we are faced with a larger governance-related challenge. In our “Christmas Day Bomber, Moore’s Law, and Enterprise IT” ZapFlash, we noted:
“…while the quantity and complexity of information in any enterprise grows exponentially, the human ability to deal with that information at best grows linearly. No matter where you put the two curves, eventually the one overtakes the other at the governance crisis point, leading to the “governance crisis point problem”: eventually, human activities are unable to deal with the quantity and complexity of information.”
Good Enterprise Architects need to have a lot of tools in their tool belt to deal with the various needs of the enterprise. In addition, they need to have the strategic and tactical imperatives of the business in mind on the one hand, while fully comprehending and understanding the technological capabilities and limitations of the IT organization on the other. Clearly, dogma serves no purpose here, regardless your particular opinions on any one archtiectural or infrastructural approach. As a result, enterprise architects who will undoubtedly need to grapple with the ongoing data explosion crisis will need to understand the nature and scope of how data creation, consumption, and management is happening in the enterprise as well as the increasing set of approaches and technologies coming to market to solve those challenges.