Tuesday, March 24, 2009

What would really help map-reduce programs?

David Hall is touting a Scala package that he calls SMR (Scala for Map Reduce).

I built something like this quite some time ago using groovy. I called the resulting Groovy map-reduce facade grool because it was pretty thin. The results were even more impressive for toy programs than what SMR achieved. For instance, here is a fully hadoop compatible map-reduce version of word count:

wc = mapReduce(
{key, line, out, report -> line.split().each {w -> out.collect(w,1)}},
{w, counts, out, report -> out.collect(w, counts.sum())}

In this example, wc is a function that can then be applied to various inputs as in

wc(["this is input text", "for the word-counter"])


wc(new File("/local/input/file")

or even

wc(functionUsingMapReduce(new HdfsFile("/existing/file")))

In spite of how pretty these tiny examples are, I eventually quit using it for day-to-day map-reduce coding.

The problem with things like grool that attempt to make writing map-reduce programs easier is that it doesn't really solve the problems that you face in writing map-reduce programs. It does make word-count much shorter (5-7 lines of code versus >200). It does involve a very cool hack so that the closures passed as arguments are available as map and reduce objects on the right machines. But I found in using grool for production code that it didn't solve the right problem.

The big problems involved in map-reduce programming are

  • map-reduce is just a tad too low-level for most problems.

  • debugging programs written using a facade like this is harder than for programs written using raw java + map-reduce.

  • the boilerplate of map-reduce isn't proportional to problem size so large programs exhibit much less improvement in grool than small programs do.

    What I would much rather have is something that provides much higher-level semantics such as those provided by Pig, but which provides for better integration into real programming languages. The better integration that I need goes two directions. First, I want to be able to, for instance, compute every date in a date range, use those dates to form input paths and then do a computation on those input paths. Pig can't do that except via a very ugly argument substitution hack.

    A second kind of better integration is what Chris Wenzel has been pushing for some time with Cascading. I would like it if I could write programs in what might be called PL/Pig (by analogy with PL/SQL) that would then be tightly integrated into a Cascading flow. Ideally, the PL/Pig program would expose its execution plan to Cascading for global optimization and scheduling.