Tuesday, March 25, 2008

Hello world for map-reduce

There has been a surge of interest lately in map-reduce based parallel from many quarters. The public, non-google side of the that interest is largely focussed around an open source system of software known as Hadoop.

Lots of people are doing some very cool things with Hadoop. Yahoo is now running key parts of their web indexing infrastructure using Hadoop. Academics have described very impressive results in large scale machine translation developments. Facebook uses Hadoop for key parts of their metrics pipeline. IBM has developed a functional programming system on Hadoop. Imitators have been spawned.

But one key ingredient of any new computing technology is still missing.

What Hadoop has been missing to date is a good way to write a hello-world application in just a few lines of code. It has often been said that the ability to do simple things simply is a critical capability for a new technology. Hadoop, and map-reduce programming in general, does not lack a good hello-world example; the word-counting example provides that. What is missing is the ability to write this program in just a few lines of code. The word-count example that comes with the Hadoop distribution is over a hundred lines of code, most of which are impossible to motivate for a new user.

The basic idea is simple, the map phase tokenizes input lines and emits words as keys with (initial) counts of 1. The reduce phase takes all of the counts for a single word and sums them up. As a refinement, we can use a combiner to pre-reduce counts within the output of a single mapper.

So why is this more than 5 lines of code? The answer is that running a real world map-reduce cluster has some real complexity if you want the best possible results. Another reason is that in Java hiding complex implementations requires lots of noise tokens. It isn't that Java is bad, it is just that it isn't good for hiding complexity from naive users.

My answer to this problem is to move to Groovy, a language whose raison d'etre is specifically to hide the cruft in Java programs. Groovy isn't Java or an extension of Java, but Groovy programs inter-call with Java programs completely transparently and compile to the same byte codes.

So what should word-count look like? I think that it should look like functional programming. Our mapper should look like a function as should our reducer. They should be combined to produce a map-reduce program which should itself be a function that takes inputs and produces outputs. Those inputs should be any kind of input that we might like to use including lists of strings (for testing), local files and files stored in a distributed data store.

So our program really ought to look like this:
  wc = mapReduce(
{key, line, out, report -> line.split().each {w -> out.collect(w,1)}},
{w, counts, out, report -> out.collect(w, counts.sum())}
)
and we should be able to use this program in a lot of different ways:
  wc(["this is input text", "for the word-counter"])
wc(new File("/local/input/file")
wc(findInterestingTextUsingMapReduce(new HdfsFile("/existing/distributed/file")
and we should be able to access the output very simply:
  wc(["this is input text"]).eachLine {
println it
}
Subject to some limitations in Groovy 1.5.4, this is exactly what my new system does. I call this new system grool because it is pretty thin porridge right now.

Send me email for a copy. I will get it onto a standard kind of open-source repository soon(ish), but you can try it sooner.

2 comments:

Ken Weiner said...

What is the status of this great suggestion? I saw https://issues.apache.org/jira/browse/HADOOP-2781 but couldn't tell if progress is being made. Is any help needed? Where can I get the current grool code?

Ted Dunning ... apparently Bayesian said...

Well, the status is a bit sad. I did the code while at Veoh amid parallel efforts using Cascading, Pig and Oracle for extraction of metric data from large event logs.

In the end, it was clear that writing map-reduce programs for this kind of data reduction task was too low level and that Cascading and Pig had substantial advantages in terms of code complexity. At least as importantly, the effort to maintain an internal system like grool and the difficult in debugging imperative programs in a parallel environment with only a half-baked solution like grool were clearly pretty large.

The upshot is that grool is still private and completely unmaintained. I will send email back to Veoh to see if I can get you a copy, though.