Adam Hurwitz: 2009

Thursday, December 17, 2009

The Daily Scrum or Morning Meeting

A scrum or morning meeting is the best way to get control of a development process. Having it in the morning seems like the most natural part of the day for everyone to synchronize, obviously, before everyone has started working. Not first thing in the morning because everyone needs time to have some breakfast, review new email, read the news, and get settled in. About a half-hour should do it. The other good thing about having it in the morning is to ensure that everyone gets to work on time, which gets to the essence of how a scrum works - peer pressure. When someone is late to the meeting, it is not just clear to their manager, but to the whole team. So they are not letting their manager down, they are letting the team down.

During the meeting, each member of the team goes over what they have accomplished since the day before and what they are working on. According to strict principles, the meeting is not supposed to become a status meeting run by the project manager (technically the role is supposed to be called a scrum master, but I don't like that name). Personally I like having the project manager run the meeting. I prefer the direction that it gives. And you still end up with the peer pressure because everyone is listening.

Sunday, December 13, 2009

Using Humans in Your System Design

A mistake that a lot of people make when creating software is to try to automate everything. When you are making software for a business, you are generally automating a process and there is often a desire or drive that people have to make software that automates the whole process. They get into an all-or-nothing frame of mind that, unfortunately, usually ends up with nothing getting deployed, or, even worse, software that gets deployed which tries to do it all and fails, disrupting the business and wasting a lot more money.

There are many different strategies and methodologies to deal with this situation. Agile is probably the most popular and successful. One of the main tenets of Agile is to deliver business value as soon as possible. And in most situations in order to deliver that value quickly, you obviously can't program everything. And if you can't program everything, you are going to have to learn how to design humans into your system. This is something that a lot of software people seem uncomfortable with because they view the system they are building purely as a piece of software and think that automation is the ideal. But full automation is not necessarily the goal, creating business value is. (I'll have more to say on this in a later post.)

Saturday, October 24, 2009

Computer Forensics Paper

Forensic Focus, the leading computer forensics community site, has posted one of my papers, titled, "Simple Steganography on NTFS when using the NSRL." It's a relatively simple idea, but important for computer forensics investigators who use the NSRL (National Software Reference Library from NIST). For those of you who aren't familiar with the NSRL, it contains the hashes of millions of files from operating systems and applications. This information is then used to identify and filter out the files that you have to look at during an investigation. This is standard practice in computer forensics, and the paper is about some steganography that you have to look out for.

Tuesday, October 13, 2009

Using Logic Puzzles in Interviews

I have never been a fan of using logic puzzles in the hiring process. The practice apparently originated at Microsoft. The idea for using them is based in the fact that technology is always changing. And so you don't want to test anyone on a current technology, but rather on their general logic abilities which will let you know whether they will continue to be useful to the company in the future when it has to adopt new technologies. This line of reasoning is compelling and it seems like a lot of tech companies have accepted it.

Personally, I don't buy the argument. For one thing I don't see why the puzzle necessarily tests someone's logic abilities better than a programming test, which gives someone a formal way to reason - the programming language. It also does not necessarily follow that someone who can solve logic puzzles well is going to be any good at programming which has all kinds of constraints and doesn't rely on clever tricks. I think it's far better to give someone a real programming problem that you have at work and see how they go about solving it. As it's a real work problem, you should be familiar with the details and some of the approaches to solving it. From this familiarity with the problem you should be able to tell more about someone's ability to reason and think logically, then from a puzzle which has no context.

As for whether a potential programmer will be able to move on to other technologies or languages, it makes more sense to me to see how they understand the concepts behind what they are doing and also to get a sense of what kind of person they are. When it comes to learning a new technology, what you need more than some general logical abilities is good curiosity and motivation.

Sunday, October 11, 2009

Distributed Computing Models

There are a number of general models for distributed computing that exist. A lot of the terms are used interchangeably, and there are lots of systems that seem to fall in-between these models or that combine them. Nevertheless I think it is useful to make distinctions and define the different models this way:

Client-Server, 3-tier, N-tier - The processing is distributed through the use of layers. There are layers for UI, for business logic, for data storage, etc. These are generally data-driven applications.

Clustered - A set of machines act as one. There are usually shared data stores and the multiple machines are effectively transparent. This is used for things like load-balancing or fault tolerance.

Peer-to-Peer - These systems are decentralized and used for applications like file sharing or instant messaging. In practice these systems need some centralization at least for user management.

Grid - These are systems where the processing is split up so that many machines can work in parallel. These are becoming the most popular because they are necessary for big data systems.

So when you think of distributed systems, there really seem to be 4 concepts: layers, unified, decentralized, and parallel.

Let me know if you think I'm missing something.

Saturday, October 3, 2009

Hadoop World 2009

I went to Hadoop World: NYC 2009 on Friday, October 2. It was organized by Cloudera, the company that provides professional support and training for Hadoop. (Amr Awadallah, their CTO, sent me a discount code - Thanks Amr!)

The first time that I really took notice of Hadoop was early last year. It's amazing to see how much ground it's covered since then. At the conference there was a whole track devoted to applications. There was your usual bunch of niche companies using it, but also presentations by VISA, JP Morgan Chase, eBay, and other big names. A lot of people are using it in conjunction with Lucene.

What's becoming clear to me is that Hadoop is becoming THE platform for data analysis and processing. There are other systems out there to handle large data sets, most of them are based in some way on a relational database and incorporate MapReduce and a distributed architecture, but none of them seem to have the flexiblity of Hadoop. There are a range of useful applications, for example, that can be built which just use the HDFS (the Hadoop Distributed File System).

Monday, August 24, 2009

Finance company uses daily scrum

In an article on BlackRock, the biggest asset manager in the world with $3 trillion under management, it was revealed that one of their management techniques for keeping the company in-sync and feeling "small" is to have a mandatory, daily morning meeting at 8am. Everyone involved gives short one-minute presentations seemingly on whatever they are working on. The article doesn't get into specifics about the meeting, but it sounds very similar to a daily scrum. Check out the article in Fortune.

Tuesday, July 7, 2009

Speaking engagement at ITARC New York

IASA's IT Architect Regional Conference is being held in New York on October 12-14. ITARC NYC. This is a great few days with people who are passionate about software architecture. This year I will be speaking about Distributed Computing in the Enterprise. You can read the description here.

Monday, June 15, 2009

Tech Hiring Process

Having a well-defined hiring process for technical employees is the essential first step in having a great development organization. I don't think this is anything really new or surprising to say, but what exactly that process looks like may not be so obvious, especially for people new to hiring for technical positions.

Here is a basic structure that you can use to make your own process:

1. Initial screen email
2. Phone screen
3. Written test
4. Interviews

The initial screen email is just to cover real basic issues like whether the person can legally work in the US, whether they are really looking for full-time work, whether they need to relocate, and other such issues. You'd be surprised how many people send out their resumes without really reading certain details about the position. And be sure to ask about salary expectations or last salary earned, if you don't put a range for the position on the initial ad. I think you need to know where someone is with salary before even talking to them because you really do not want to be surprised later on. It can also be a bad sign if someone does not want to answer a question about salary or just says that it is negotiable. In this situation, you are more than likely dealing with someone who is not that serious. Good people know what they want and aren't bashful about saying where they are with pay.

The phone screen should be short, about 20 minutes. There should be 4-5 technical questions that cover basic knowledge required for the position. You should ask the same questions of everyone so you can hear the differences. You should also ask a question about what they're working on to get a sense of how they communicate. This should weed out a lot of candidates.

The written test should make the candidate do something that they would be faced with on the job. There are some differences in opinion here, but I believe that trying to simulate real work problems that come up in the position is the way to go, even to the extent of giving them a recent problem or issue faced by your team. When the candidate finishes the test, you can go over it with them and get an understanding of how they tackle problems. I think this gives you the clearest picture of what the person would be like if they came to work because, really, you gave them a little work to do.

The final interviews are with team members and I think it's a good idea to prep them with questions. They can ask what they want to, but make sure they have access to questions so they don't have to worry about it. Personally, I don't like the brain teasers.

Thursday, April 30, 2009

Non-Relational Databases in the Enterprise

There has been a lot of talk recently about “key-value” and “document-oriented” databases. These non-relational databases have become necessary for web applications. They allow for fast writes and can scale out for systems that don't need the rigid structure of a relational db and its querying abilities. There's plenty of information on-line about them. This is a good list and write up here: Anti-RDBMS: A list of distributed key-value stores

It's hard to tell at this point which ones will still be actively developed and used a few years from now. I would assume that the Apache projects have as good a chance as any of them.

I'm interested generally in how these systems can be used inside the enterprise or for non-web applications. Now these systems are built for semi-structured data (key-value) and there is plenty of this kind of data in enterprise systems. Often this data seems somehow extra or may have a variable nature. A good example of this is the properties of a file (author, subject, date created, etc.). This kind of data can be found in lots of existing relational databases in tables that have a foreign key and, not surprisingly, columns usually called “key” and “value.” I've seen these kinds of tables in lots of systems. The important thing to realize is that the data does not need to be used in a query – it does not need to appear in a SQL where-clause. So really there is no need to keep it in the relational database, except for the fact that you want to persist the data in a secure way.

Another option for this data, of course, has been to use XML files. In this kind of solution you would probably have to rely on organizing the information using certain directory and file names. The file would most likely be named with the foreign key. Then you would have to write the code to manage those files, which at the least means a component to read / write the XML files.

But the cost of keeping this data either in a relational database or in XML files ends up being high because you have to consider availability and integrity. For both of these solutions that usually ends up meaning a cluster set-up at the “front” with a RAID array for storage and a somewhat complicated back-up processes.

Cost seems to make the non-relational database systems particularly attractive for the enterprise. The non-relational databases have been specifically developed with the idea that you can use cheap hardware to scale them out. They are distributed systems and rely on different replication schemes to keep copies of the data on a certain minimum number of machines at all times to ensure that the data is always available. People generally seem to feel comfortable with the same data existing on at least 3 machines. These machines theoretically do not need to be much more powerful than a regular desktop machine. Start adding a few more machines and your capacity and savings should really start to add up.

Of course there would be training and switch-over costs, but your programmers will be happy to work on the new technology. For a large company that has many internal, proprietary systems, there is probably a lot of money to save by creating one of these clusters and consolidating all that semi-structured data into it. Save the expensive storage for the highly structured, transactional data.

Ordering for Trees in SQL

Here's an article I posted on Code Project a while back. I still think it's a nifty technique for a very specific situation where you are storing a tree in SQL using an adjacency list model and need depth-wise ordering for it.

Depth-wise Ordering for Trees in SQL

Adam Hurwitz