Sunday, January 30, 2011
What is BigData?
Practically when people speak about it, I believe they are referring to two specific things: 1. the new set of data management and processing technologies that involve either distributed or parallel computing; 2. an exploratory Business Intelligence activity that utilizes machine learning and statistics to extract knowledge from the data.
Sunday, October 11, 2009
Distributed Computing Models
Client-Server, 3-tier, N-tier - The processing is distributed through the use of layers. There are layers for UI, for business logic, for data storage, etc. These are generally data-driven applications.
Clustered - A set of machines act as one. There are usually shared data stores and the multiple machines are effectively transparent. This is used for things like load-balancing or fault tolerance.
Peer-to-Peer - These systems are decentralized and used for applications like file sharing or instant messaging. In practice these systems need some centralization at least for user management.
Grid - These are systems where the processing is split up so that many machines can work in parallel. These are becoming the most popular because they are necessary for big data systems.
So when you think of distributed systems, there really seem to be 4 concepts: layers, unified, decentralized, and parallel.
Let me know if you think I'm missing something.
Saturday, October 3, 2009
Hadoop World 2009
The first time that I really took notice of Hadoop was early last year. It's amazing to see how much ground it's covered since then. At the conference there was a whole track devoted to applications. There was your usual bunch of niche companies using it, but also presentations by VISA, JP Morgan Chase, eBay, and other big names. A lot of people are using it in conjunction with Lucene.
What's becoming clear to me is that Hadoop is becoming THE platform for data analysis and processing. There are other systems out there to handle large data sets, most of them are based in some way on a relational database and incorporate MapReduce and a distributed architecture, but none of them seem to have the flexiblity of Hadoop. There are a range of useful applications, for example, that can be built which just use the HDFS (the Hadoop Distributed File System).
Tuesday, July 7, 2009
Speaking engagement at ITARC New York
Thursday, April 30, 2009
Non-Relational Databases in the Enterprise
It's hard to tell at this point which ones will still be actively developed and used a few years from now. I would assume that the Apache projects have as good a chance as any of them.
I'm interested generally in how these systems can be used inside the enterprise or for non-web applications. Now these systems are built for semi-structured data (key-value) and there is plenty of this kind of data in enterprise systems. Often this data seems somehow extra or may have a variable nature. A good example of this is the properties of a file (author, subject, date created, etc.). This kind of data can be found in lots of existing relational databases in tables that have a foreign key and, not surprisingly, columns usually called “key” and “value.” I've seen these kinds of tables in lots of systems. The important thing to realize is that the data does not need to be used in a query – it does not need to appear in a SQL where-clause. So really there is no need to keep it in the relational database, except for the fact that you want to persist the data in a secure way.
Another option for this data, of course, has been to use XML files. In this kind of solution you would probably have to rely on organizing the information using certain directory and file names. The file would most likely be named with the foreign key. Then you would have to write the code to manage those files, which at the least means a component to read / write the XML files.
But the cost of keeping this data either in a relational database or in XML files ends up being high because you have to consider availability and integrity. For both of these solutions that usually ends up meaning a cluster set-up at the “front” with a RAID array for storage and a somewhat complicated back-up processes.
Cost seems to make the non-relational database systems particularly attractive for the enterprise. The non-relational databases have been specifically developed with the idea that you can use cheap hardware to scale them out. They are distributed systems and rely on different replication schemes to keep copies of the data on a certain minimum number of machines at all times to ensure that the data is always available. People generally seem to feel comfortable with the same data existing on at least 3 machines. These machines theoretically do not need to be much more powerful than a regular desktop machine. Start adding a few more machines and your capacity and savings should really start to add up.
Of course there would be training and switch-over costs, but your programmers will be happy to work on the new technology. For a large company that has many internal, proprietary systems, there is probably a lot of money to save by creating one of these clusters and consolidating all that semi-structured data into it. Save the expensive storage for the highly structured, transactional data.