Wednesday, June 22, 2011

Buzzwords Keynote - Part 2

In the first part of the talk, I made the case that Apache Hadoop has lots of head-room in terms of performance.  This translates into lots of opportunity both for open source developers to make Hadoop itself better, but also for companies to build products that derive from Hadoop but improve it in various ways.

The $S$ score

In honor of Steve Jobs whose highest praise is reputedly to say "that doesn't suck", I proposed an $S$ score whose highest score is zero, but for all real systems is always negative.  For a batch, data processing system like Hadoop, I proposed that a good definition of $S$ was the log base 10 of the ratio of the actual performance to the performance implied by hardware limits.

Not suprisingly, the overall score for Hadoop comes out to be somewhere between -5 to -2 depending on desired workload (i.e. Hadoop runs programs somewhere between 100 and 100,000 times slower than the hardware would allow).  For some aspects, Hadoop's $S$ score can be as good as $-0.5$ but generally there are multiple choke-points and some of these are additive.  This is hardly news and isn't even a mark of discredit to Hadoop since the developers of Hadoop have always prized getting things to work and to work at scale above getting things to work within an iota of the best the hardware can do at a particular scale.  Another factor that drives $S$ down for Hadoop is the fact that the hardware we use has changed dramatically over the 6-7 year life of Hadoop.

In defining the value of $S$ for current Hadoop versions, I don't mean to include algorithm changes.  Michael Stonebraker has become a bit famous for running down Hadoop for not doing database-like things with database-like algorithms, but I would like to stick to the question of how fast Hadoop could do what Hadoop is normally and currently used to do.

The key conclusion is that having such a low $S$ combined with high demand for Hadoop-like computation represents a lot of opportunity.  This opportunity involves opportunities for the open source community to make things better.  It also represents opportunities for commercial companies to make money.  The latter kind of opportunity is what is going to shake up the currently cozy Hadoop community the most.

No comments: