Friday, July 30, 2010

The new grool

A few years ago, I built a prototype system I called grool in Groovy to simplify map-reduce programming.  My goal was to use Groovy to handle control flow and define operations but to execute large programs using Hadoop.

Grool foundered on the difficulty in transporting closures over the network.  I used a clever trick to avoid the problem, but it depended on the ability to execute a script multiple times with different meanings each time.  That is a difficult concept to convey to users and the result was that grool really didn't work that well for ordinary folks to use.

A bit later, the guys at Google developed FlumeJava which has many of the same goals and many of the same benefits as grool intended to provide.  In Java, however, transporting functional objects it paradoxically much simpler than in Groovy.  The difference is entirely because Java is statically compiled and thus state-less functional classes can be referred to by name in different JVM's with access to the same jar.

Flume also provide an optimizer which is made possible because Flume uses lazy evaluation.   This makes FlumeJava programs nearly as efficient as well-optimized hand-written programs.  Other systems like Pig and Cascading are able to re-write their logical plan, but Pig especially has problems because it has no real access to a Turing complete language.

In addition, Flume has some interesting choices in terms of API.

All in all, Flume-like systems are definitely worth playing with.  In order to make that easier, I just implemented an eager, sequential version of an approximate clone of FlumeJava that I call Plume.  The name is a presumptuous one, anticipating that if all goes well, we would be able to build a community and bring Plume into Apache where it would be Apache Plume.  There it would provide some redress for the clear fauna bias in software names.

Check it out at http://wiki.github.com/tdunning/Plume/