Adam Hurwitz: distributed

Showing posts with label distributed. Show all posts

Sunday, January 30, 2011

What is BigData?

At some point quantity becomes quality. You can add a few more of something and suddenly it can become something different. For data we have crossed that threshold again over the last decade. Now whether that point is defined as terabytes or petabytes is in-itself relatively unimportant. What is important is that people have had to manage this qualitatively different amount of data, and in order to do so they have created new technologies, new techniques, and new ways of thinking about and analyzing the data. This in turn, of course, has created new opportunities. This new paradigm is known as BigData.

Practically when people speak about it, I believe they are referring to two specific things: 1. the new set of data management and processing technologies that involve either distributed or parallel computing; 2. an exploratory Business Intelligence activity that utilizes machine learning and statistics to extract knowledge from the data.

Sunday, October 11, 2009

Distributed Computing Models

There are a number of general models for distributed computing that exist. A lot of the terms are used interchangeably, and there are lots of systems that seem to fall in-between these models or that combine them. Nevertheless I think it is useful to make distinctions and define the different models this way:

Client-Server, 3-tier, N-tier - The processing is distributed through the use of layers. There are layers for UI, for business logic, for data storage, etc. These are generally data-driven applications.

Clustered - A set of machines act as one. There are usually shared data stores and the multiple machines are effectively transparent. This is used for things like load-balancing or fault tolerance.

Peer-to-Peer - These systems are decentralized and used for applications like file sharing or instant messaging. In practice these systems need some centralization at least for user management.

Grid - These are systems where the processing is split up so that many machines can work in parallel. These are becoming the most popular because they are necessary for big data systems.

So when you think of distributed systems, there really seem to be 4 concepts: layers, unified, decentralized, and parallel.

Let me know if you think I'm missing something.

Saturday, October 3, 2009

Hadoop World 2009

I went to Hadoop World: NYC 2009 on Friday, October 2. It was organized by Cloudera, the company that provides professional support and training for Hadoop. (Amr Awadallah, their CTO, sent me a discount code - Thanks Amr!)

The first time that I really took notice of Hadoop was early last year. It's amazing to see how much ground it's covered since then. At the conference there was a whole track devoted to applications. There was your usual bunch of niche companies using it, but also presentations by VISA, JP Morgan Chase, eBay, and other big names. A lot of people are using it in conjunction with Lucene.

What's becoming clear to me is that Hadoop is becoming THE platform for data analysis and processing. There are other systems out there to handle large data sets, most of them are based in some way on a relational database and incorporate MapReduce and a distributed architecture, but none of them seem to have the flexiblity of Hadoop. There are a range of useful applications, for example, that can be built which just use the HDFS (the Hadoop Distributed File System).

Tuesday, July 7, 2009

Speaking engagement at ITARC New York

IASA's IT Architect Regional Conference is being held in New York on October 12-14. ITARC NYC. This is a great few days with people who are passionate about software architecture. This year I will be speaking about Distributed Computing in the Enterprise. You can read the description here.

Thursday, April 30, 2009

Non-Relational Databases in the Enterprise

There has been a lot of talk recently about “key-value” and “document-oriented” databases. These non-relational databases have become necessary for web applications. They allow for fast writes and can scale out for systems that don't need the rigid structure of a relational db and its querying abilities. There's plenty of information on-line about them. This is a good list and write up here: Anti-RDBMS: A list of distributed key-value stores

It's hard to tell at this point which ones will still be actively developed and used a few years from now. I would assume that the Apache projects have as good a chance as any of them.

I'm interested generally in how these systems can be used inside the enterprise or for non-web applications. Now these systems are built for semi-structured data (key-value) and there is plenty of this kind of data in enterprise systems. Often this data seems somehow extra or may have a variable nature. A good example of this is the properties of a file (author, subject, date created, etc.). This kind of data can be found in lots of existing relational databases in tables that have a foreign key and, not surprisingly, columns usually called “key” and “value.” I've seen these kinds of tables in lots of systems. The important thing to realize is that the data does not need to be used in a query – it does not need to appear in a SQL where-clause. So really there is no need to keep it in the relational database, except for the fact that you want to persist the data in a secure way.

Another option for this data, of course, has been to use XML files. In this kind of solution you would probably have to rely on organizing the information using certain directory and file names. The file would most likely be named with the foreign key. Then you would have to write the code to manage those files, which at the least means a component to read / write the XML files.

But the cost of keeping this data either in a relational database or in XML files ends up being high because you have to consider availability and integrity. For both of these solutions that usually ends up meaning a cluster set-up at the “front” with a RAID array for storage and a somewhat complicated back-up processes.

Cost seems to make the non-relational database systems particularly attractive for the enterprise. The non-relational databases have been specifically developed with the idea that you can use cheap hardware to scale them out. They are distributed systems and rely on different replication schemes to keep copies of the data on a certain minimum number of machines at all times to ensure that the data is always available. People generally seem to feel comfortable with the same data existing on at least 3 machines. These machines theoretically do not need to be much more powerful than a regular desktop machine. Start adding a few more machines and your capacity and savings should really start to add up.

Of course there would be training and switch-over costs, but your programmers will be happy to work on the new technology. For a large company that has many internal, proprietary systems, there is probably a lot of money to save by creating one of these clusters and consolidating all that semi-structured data into it. Save the expensive storage for the highly structured, transactional data.

Adam Hurwitz