Thursday, April 30, 2009

Non-Relational Databases in the Enterprise

There has been a lot of talk recently about “key-value” and “document-oriented” databases. These non-relational databases have become necessary for web applications. They allow for fast writes and can scale out for systems that don't need the rigid structure of a relational db and its querying abilities. There's plenty of information on-line about them. This is a good list and write up here: Anti-RDBMS: A list of distributed key-value stores

It's hard to tell at this point which ones will still be actively developed and used a few years from now. I would assume that the Apache projects have as good a chance as any of them.

I'm interested generally in how these systems can be used inside the enterprise or for non-web applications. Now these systems are built for semi-structured data (key-value) and there is plenty of this kind of data in enterprise systems. Often this data seems somehow extra or may have a variable nature. A good example of this is the properties of a file (author, subject, date created, etc.). This kind of data can be found in lots of existing relational databases in tables that have a foreign key and, not surprisingly, columns usually called “key” and “value.” I've seen these kinds of tables in lots of systems. The important thing to realize is that the data does not need to be used in a query – it does not need to appear in a SQL where-clause. So really there is no need to keep it in the relational database, except for the fact that you want to persist the data in a secure way.

Another option for this data, of course, has been to use XML files. In this kind of solution you would probably have to rely on organizing the information using certain directory and file names. The file would most likely be named with the foreign key. Then you would have to write the code to manage those files, which at the least means a component to read / write the XML files.

But the cost of keeping this data either in a relational database or in XML files ends up being high because you have to consider availability and integrity. For both of these solutions that usually ends up meaning a cluster set-up at the “front” with a RAID array for storage and a somewhat complicated back-up processes.

Cost seems to make the non-relational database systems particularly attractive for the enterprise. The non-relational databases have been specifically developed with the idea that you can use cheap hardware to scale them out. They are distributed systems and rely on different replication schemes to keep copies of the data on a certain minimum number of machines at all times to ensure that the data is always available. People generally seem to feel comfortable with the same data existing on at least 3 machines. These machines theoretically do not need to be much more powerful than a regular desktop machine. Start adding a few more machines and your capacity and savings should really start to add up.

Of course there would be training and switch-over costs, but your programmers will be happy to work on the new technology. For a large company that has many internal, proprietary systems, there is probably a lot of money to save by creating one of these clusters and consolidating all that semi-structured data into it. Save the expensive storage for the highly structured, transactional data.


Ordering for Trees in SQL

Here's an article I posted on Code Project a while back. I still think it's a nifty technique for a very specific situation where you are storing a tree in SQL using an adjacency list model and need depth-wise ordering for it.

Depth-wise Ordering for Trees in SQL