Surprise and Coincidence

Wednesday, December 17, 2008

Students learn what they need, not what is assigned

A friend recently wrote about her experience with the currently fashionable style of teaching where students are expected to reinvent mathematics.

She wrote:

So much attention is given as to why - and to having students figure out algorithms themselves. Some teachers feel this is a waste of time - time that could be used for the students to be practicing the algorithms and becoming more fluent.

I think that this is largely a waste as well. We cannot expect students of all abilities to recapitulate in 6 years the history of more than 2000 thousand years of thought by the best minds the earth has ever seen. To imagine that they can is simply ludicrous. To imagine that what they can invent in this short time and that they don't need the benefit of those great thoughts is similarly ludicrous.

At the risk of sounding self-contradictory, I also think that when the student owns the task of learning, they will learn enormously better than when the task is imposed. This does not imply that the student discover everything, however. It merely implies that they need to discover the need for the things that they learn. Students who need to learn something can learn from almost any source, even from something as currently unfashionable as, say, sitting quietly thorugh traditional lecture.

The last time I taught in the classroom was as a member of a two-person teaching team teaching a software engineering class on machine level programming. In the past, this had been done by lecture and assignment and was truly a stunningly boring class. On the first day, I turned the structure of the class upside-down and assigned the entire final exam. This consisted of a single question in the form of a task (to build a robot that would drive around as fast as possible following a line on the floor). I then passed out soldering irons, computer components and kits of lego parts and told them to get to work. For the record, I had never tried to build such a 'bot myself.

This tactic resulted, as you would expect, in panic. The students complained that they didn't know how to solder, that they didn't know anything about the computer I had given them and that they didn't know how to build robots. I told them that they would have to learn all this and much more and that I would try to help them find out all this information, but they would have to tell me what they wanted to learn. Early on, the questions were about soldering. Over time, the questions became more and more sophisticated. At the beginning of the class, we had a list of lectures that we wanted to give, but we held them back until somebody asked a related question. At that point we would have a vote among the class whether they would like to have a lecture on the subject or would rather continue work on the robots.

By the end of the semester, I was getting complaints from the department because my students were (voluntarily) spending so much time on my class that they were neglecting their other classes. Some were spending 40 hours or more in the computer lab and many had built remarkable contraptions little related to the impending exam. This enthusiasm translated into perfect line-following performance on sample lines.

But ...

What I hadn't told the students was that the final would (for the first time) involve a line that crossed itself. Essentially all of the robots would fail on this line because they hadn't been designed or programmed to deal with that case. The real exam was whether they could deal with the unexpected and redesign and reprogram their robots during the 3 hour exam period to succeed on the harder problem.

All students succeeded.

Moreover two thirds of the class came back a week after the official final to repeat it in front of a friend who flew down to see how the class had worked out. I think, but do not have much data to support the assertion that it is rare for students to ask to be allowed to put off summer vacation so that they can repeat a final exam.

The point of it all was that these students could learn vastly more than was expected of them if they just wanted to. They could learn this material almost effortlessly in a traditional setting where somebody (me) wrote illegibly on the blackboard on difficult and abstract topics. But previous classes with exactly the same lectures given by better teachers than me had failed abysmally. I think the difference was that my students felt that they absolutely needed to learn what was in the lecture ... indeed, they had to ask for the lecture before I would give it. They also felt very much the owners of their task.

In the end, the students felt that they had invented almost everything they needed. In fact, I had spoon-fed fed them almost all of the key material. I told them how to make motors turn, how to make lights turn on, how to assemble and load software and how to design a software project. That didn't matter because they didn't remember that. What they remembered was that they initiated their learning of all of the material. They owned the class. I was their assistant, not their master.

And they learned. A lot.

Tuesday, December 16, 2008

So what about the question of why two negatives multiplied give a positive?

The short answer is that if this were not so, the world would not exist.

That is, the fundamental reason that multiplication works that way is because otherwise the definition would be broken. By broken, I mean that it would lead you to nonsensical conclusions if you followed it as far as you could. Any definition multiplication in a system with negative numbers has to work pretty much that way.

On Tue, Dec 16, 2008 at 6:42 AM, wrote:

Why does a negative number times a negative number equal a positive number?
How can you show that using a manipulative?

Let's take the concrete (but abstract) example of arithmetic on a 7 hour clock. Here is the addition table:

+ 0 1 2 3 4 5 6
0 0 1 2 3 4 5 6
1 1 2 3 4 5 6 0
2 2 3 4 5 6 0 1
3 3 4 5 6 0 1 2
4 4 5 6 0 1 2 3
5 5 6 0 1 2 3 4
6 6 0 1 2 3 4 5

OK. Now we can build multiplication on top of that. I will use * instead of x to indicate multiplication because I don't have really good fonts.

* 0 1 2 3 4 5 6
0 0 0 0 0 0 0 0
1 0 1 2 3 4 5 6
2 0 2 4 6 1 3 5
3 0 3 6 2 5 1 4
4 0 4 1 5 2 6 3
5 0 5 3 1 6 4 2
6 0 6 5 4 3 2 1

OK. Not much to see here except that in the multiplication every row has all of the integers. Moreover, every row has these integers in a different order. This is a consequence of the fact that 7 is a prime number. It doesn't happen that way on the 12 hour clock.

But let's look again at the addition table. Note that 1 + 6 = 0 and that 2 + 5 = 0. Notice also that 0 only appears once in each row. That means that we can consider 6 to be a way of writing -1. Or perhaps -1 is a way to write 6. This lets us solve addition problems such as 3 + x = 5 by adding 4 (which is -3) to both sides.

But what happens when we multiply 3 * (-2)? Do we get -6?

Well, -2 = 5 so this should be the same as 3 * 5 which is 1. But 1 is also -6 !

And all of our properties like the distributive law should still work.

Thus 3 * (-2) + 3 * 4 = 1 + 5 = 6 should be 3 * (-2 + 4) which is 3 * 2 = 6. So that works.

What about (-3) * (-2) = 6?

Well, this should be (and is) the same as 4 * 5 = 6.

That's cool.

In a very limited and geeky dull kind of way.

What is more cools is that this is all just a very concrete example of how the laws or arithmetic imply a certain kind of order. We could test this out for all the different kinds of arithmetic that have the properties that we like about integers. I don't mind starting infinite tasks, but I do mind having to finish them.

SO

Let's do much better by reasoning from those properties directly.

For instance, assume that we have the following:

an additive unit, 0 such that x + 0 = 0 + x = x and thus 0 * x = 0
an additive inverse, -x = 0 - x
left distributive law, (a + b) * x = a * x + b * x
right distributive law, x * (a + b) = x * a + x * b

Now take x * (0 - y). This has to be equal to (x * 0) - (x * y) = 0 - (x * y) = - (x * y)

Or (0 - x) * (0 - y) = 0 - x * (0 - y) = 0 - (0 - x * y) = x * y

This means that ANY kind of arithmetic where multiplication and addition have an additive inverse and left and right distributive laws will follow the pattern that (-x) * (-y) = x * y

OR

if that system doesn't have that, then it won't have one or all of the properties mentioned.

This works for a spinning globe, for quantum mechanics, for clocks and rubik's cubes.

And THAT is why abstraction is cool. But not why it is useful. That comes next.

Mathematics as an article of faith

A friend asked me a wonderful question recently.

She asked:
> I feel like math is so wonderful and useful to a certain point. (Like sixth grade.)
> Then it seems abstract... I can't see the purpose.

She was very much right that mathematics becomes abstract at that point.

And that is very much the point of it. It allows us to think abstractly. It allows us to find patterns in many different things that work the same way. That allows us to think a problem through in mathematically form and then recognize that form again and again.

One good example is modular arithmetic. We could call it clock arithmetic and talk about integral hours that wrap around the 12 hour clock. We could build an addition table that tells us everything there is to know about that kind of arithmetic. We could extend that addition table to be multiplication by repeated multiplication.

That is all well and good. But it only tells us about clock arithmetic on a 12 hour clock. It doesn't tell us as it stands about clock arithmetic on an 11 or 13 or 24 hour clock. It doesn't tell us about arithmetic on a clock where the time isn't just an integer, but can be between the hours. Maybe we don't care about those things because we don't have to tell time on a 17 hour clock very often.

But what about the fact that the original arithmetic that we started with on our 12 hour clock also works just fine for .... music on a western scale.

Should we spend the same amount of time reinventing all of what we learned about the 12 hour clock when it is exactly the same as for musical scales? Or should we re-use that knowledge?

Well, to re-use that knowledge, we have to stop a moment and erase all the places where we originally said "clock" or "hour" or "one day later" and replace them with the integers from 0 to 11 inclusive and replace "move clockwise one hour" with "add one, reduce by removing 12's". This abstracts the original system which can make it harder to learn, but it also makes it much more useful because we don't have to learn it again and again.

But what if somebody asks about pentatonic scales? Or about microtones? Or the 35 note scales that were talked about in the sixties?

Well, if we had started by abstracting away the number of hours on the clock and abstracting away whether the numbers were integers or real numbers, then we would be able to answer those questions instantly.

And if we had abstracted the idea of rotation a bit more, we would see that rotation around a circle can be generalized into rotation around a sphere or even more complex things. Now our clock arithmetic can solve navigational problems and get us home on a dark night.

But with just a bit more extension, that same clock arithmetic (now with 4 hours and 9 dimensions) could solve a Rubik's cube with about 5 minutes of thought.

None of the details here matter. For instance, the fact that 12 is not prime plays a big role in the nature of the arithmetic that we get on clocks, but that isn't the point of all of this.

Now, this abstraction is really frustrating because you can't always point to something and say that it is what you are talking about. In fact, the point of abstraction is that you aren't talking about something specific. This is exactly what makes it hard to teach abstraction using manipulatives. Manipulatives are all about grounding intuition in naive physics. Abstraction is all about UNgrounding your intuitions.

The point is that mathematics is all about abstraction. It is just a mechanism so that we can repeat what once appeared the work of genius without having ourselves to be genii.

So is it useful for most people to understand these wheels within wheels that turn behind the world we see?

To me, yes, it is useful and beautiful and wondrous. But I can't speak for others.

I do think that anybody with a spiritual tendency is turning away from the hand of god if they choose to not see all the kinds of order in the world.

Tuesday, November 4, 2008

natural language processing blog: Using machine learning to answer scientific questions

Try this paper for some pretty interesting results on regression in a large variable space with interactions:

http://ideas.repec.org/p/wop/pennin/01-05.html

Saturday, July 5, 2008

Death penalty and data modeling

This article just appeared in the "International Journal of Law and Information Technology". It purports to show that the death penalty is arbitrary because whether or not somebody who is sentenced to death can be predicted without considering any substantive characteristic of the crime in question. If true, the conclusion certainly would indicate that the death penalty really is applied based more on who was sentenced rather than what they did.

Unfortunately, there are a number of defects in the analysis given in that paper that me somewhat leery of taking the results at face value. Some of these problems represent insufficient detail in the article, others indicate a lack of rigor in the data analysis itself.

First, the authors include without definition the following three variables:

7. Third most serious capital offence
8. Second most serious capital offence
9. First most serious capital offence

These variables sound related to the capital crime for which the prisoner was sentenced. Without definition, we cannot judge. Moreover, without knowing how these variables were encoded, we cannot tell whether the model was able to use this input or not.

Secondly, the authors include the following four time based variables:

15. Month of conviction for capital offense
16. Year of conviction for capital offense
17. Month of sentence for capital offense
18. Year of sentence for capital offense

I understand what these variables represent, but am curious how the authors encoded them. Normally in a model like this, time would be encoded as a continuous linear variable from some relatively recent epoch. Separating a time variable like this often leads to problems if you are looking for recency effects. The other common use of separated month variables is to look for seasonality affects. Typically, though, this would be done using 1 of (n-1) encoding which you clearly did not use given how few inputs you have.

While the encoding of these variables doesn't really bear much on the validity of the results, it does make it unlikely that the model could have used these variables for the purpose intended.

Thirdly, the issue of encoding also comes up the geographical variable:

2. State

Encoded as a single variable, this is almost certainly an integer code. This is a very poor encoding of state (1 of 50 encoding would be much better, several variables with different resolution would be even better).

Regarding the learning results themselves, the authors apparently did not assess which input variables were responsible for their results. They should have used any of the standard techniques such as step-wise variable selection or input randomization. They should have also measured how much value the non-linear classifier that they used mattered to the results. Typically, this would be done by comparing the results available from a linear classifier such as a ridged logistic regression. With the encodings in use by these authors, such a straight-forward comparison is probably not viable.

The authors also do not appear to have done anything to determine whether there is a target leak in the data. This can occur when there is some apparently independent input which is accidentally highly correlated with the desired output. In this case, it is possible that one state is responsible for a large majority of the executions. This might make the output more predictable without providing much ammunition for the original argument.

Finally, the authors do not have any reference in their article about public availability of their data. Without making their data easily available and given how incredibly easy it is for novices to screw up a modeling effort, these results should be considered no more than slightly provocative.

Thursday, July 3, 2008

Why the long tail isn't as long as expected

Several bloggers and authors are finding out (belatedly relative to people involved in the industry) that the economic returns in systems that are supposedly governed by long-tail distributions are unexpectedly concentrated in the highly popular items. See here for a blogger's commentary on this article.

In practice, the long tail model does predict certain kinds of consumption very well. I speak from experience analyzing view data at Veoh where, except for the very few top titles, a unit power law described number of views pretty well. This means that the number of views from sparsely watched videos is surprisingly large, adding to the woes of anybody trying to review submissions.

Economically speaking, though, you have to factor in a few kinds of friction. These are reasonably modeled as per unit sale costs and per unit of stock costs. The per unit sale costs don't affect the distribution of revenue and profit, but the per stock unit costs definitely do. If you draw the classic Zipf curve and assume that revenue is proportional to consumption for all items, then a per stock unit cost offsets the curve vertically. The resulting profit curve tells you pretty much immediately how things will fall out.

The graph to the right illustrates. The solid line is the classic long-tail distribution and the dashed lines represent different per stock unit costs. Wherever the solid line is above the dashed line, the model implies positive returns; where it is below, it implies loss. Clearly, the lower the stock unit costs, the deeper into the distribution you can go and still make money.

If we assume that you will only be stocking items that have non-negative return, then the percentage of profit that comes from a particular part of the distribution will vary with the threshold. For high thresholds, almost all of the profit will be from mega-hits but for very low thresholds, a much larger percentage will come from low consumption items.

This sort of situation is shown in this second graph which shows the percentage of total profit achieved for different thresholds and ranks. The concentration of profit in the low rank items for high thresholds is quite clear.

A good example of a per stock unit cost is given in p2p download schemes. The first few downloads of each item have to be paid for by the original source while additional downloads make use of sunk network resources that apply minimal marginal cost back to the source. For systems like bit-torrent, you need more than a hundred downloads before you get to high p2p efficiency which makes the system useful for mainstream content and less useful for body and tail content. The Veoh p2p system has a much lower threshold and is thus useful far down into the tail. Either system, though, leads to an economic return different from the theoretical zero-cost long-tail model.

None of these observations is earth-shaking and, as far as I know these are all common wisdom among anybody trying to make money out of the long tail. Why this is news to the Harvard Business Review is the real mystery to me.

Wednesday, April 16, 2008

Reading email

And you thought that would be easy.

If you have ever written a program to receive email (not just pick it up from a POP server), you know just how painful that can be. If you wrote the program in Java, you know this even better.

The world just got better:

http://subethasmtp.tigris.org/

This package is just as simple as reading email should be. Your program decides whether to receive the email and then it gets it. The interface is tiny and it looks like they handle enough of the standards to really work.

Whew.

Tuesday, April 15, 2008

A random walk from eigenvectors to parallel page rank

Power law algorithms are ideal for computing things like page rank. That isn't obvious, however.

The basic outline idea is that hub and authority style algorithms are intimately related to eigenvector or singular value decompositions (depending on whether the links are symmetrical). This also means that there is a close relationship to asymptotic beahavior of random walks on the graph. That probably still isn't obvious, so let's dig in.

If you represent the linkage in the web by a matrix that has columns representing source page and rows representing the target page and with a 1 where-ever the source page has a link pointing to the target page, then if you start with a vector with a single non-zero element equal to 1 as a representation of a single page, then multiplying by the linkage matrix will give you a vector with 1 in the positions corresponding to the pages the original page linked to. If you multiply again, you get all the pages that you can get to in two steps from the original page.

Mathematically, if we call the original vector x and the linkage matrix A, the pages that x links to are just Ax. The pages that are two steps from x are A(Ax) = A² x.

The eigenvector decomposition of A is just a way of writing A as a product of three matrices:

A = U S U'

U' is the transpose of U, and U has the special property that U'U = I (it is called ortho-normal because of this).

S is a diagonal matrix.

There is lots of deep mathematical machinery and beautiful symmetry available here, but for now we can just take this as given.

The set of pages n steps from x are

x_n = Aⁿ x = (U S U')ⁿ x = (U S U')^n-2 (U S U') (U S U') x
= (U S U')^n-2 (U S (U'U) S U') x = (U S U')^n-2 (U S² U') x
= U Sⁿ U' x

This is really cool because Sⁿ can be computed by just taking each diagonal element and raising it to a power.

Eigenvector decompositions have other, really deep connections. For instance, if you take the elements of S (call the i-th one s_i) then

$\displaystyle\sum$ _i s_iⁿ

is the number of paths that are n steps long.

Connected (or nearly connected) clusters of pages can also be derived from the eigenvector decomposition. This is the basis of so-called spectral clustering. For some very impressive examples of spectral clustering see this paper.

So eigenvectors are cool. But how can we compute them? And how can we compute them on BIG graphs in parallel?

First, note that if Aⁿ = U Sⁿ U' and if some of the s_i are bigger than others, the big ones will quickly dominate the others. That is pretty quickly, Aⁿ $\approx$ u₁ s₁ⁿ u₁'. This means that we can compute an approximation of u₁ by just doing Aⁿ x where x is some random vector. Moreover, we can compute u₂ by starting with a different random vector and iterating the same way, but with an additional step where we forbid the result from going towards u₁. With just a few additional wrinkles, this gives us what is called the Lanczos algorithm. Golub and van Loan's excellent book Matrix Computations gives a lot of information on these algorithms.

The cool thing here is that our random vector can represent a single page and we can approximate the final result by following links. Following links is just a (human-readable) way of saying sparse matrix multiplication. If we do this multiplication against lots of different random starting points, we can quickly build parallel algorithms to compute things like page rank.

Words at random, carefully chosen

On comp.ai, Dmitry Kazakov reiterated the lonely cry of a frequentist against statistical natural language. This cry has been repeated many times over the years by many people who cannot abide the treatment of documents and language as if they were random.

Let's examine the situation more carefully.

On Apr 14, 5:28 am, "Dmitry A. Kazakov" wrote:
> ... It cannot be probability because the document is obviously not random ...

The statement "It cannot be probability ..." is essentially a tautology. It should read, "We cannot use the word probability to describe our state of knowledge because we have implicitly accepted the assumption that probability cannot be used to describe our state of knowledge".

The fact that an object has been constructed in its present state by non-random processes outside our ken is no different as far as we can tell than if the object were constructed at random (note that random does not equal uniform). What if the document were, in fact, written using the I Ching (as Philip K Dick is reputed to have written "The Man in the High Castle")? Is it reasonable to describe the text as having been randomly generated now that we know that?

Take the canonical and over-worked example of the coin being flipped. Before the coin is flipped a reasonable observer who knows the physics of the situation and who trusts the flipper would declare the probability of heads to be 100%. After the coin is flipped, but before it is revealed, the situation is actually no different. Yes, the coin now has a state whereas before the coin was only going to have a state, but, in fact, the only real difference is that the physics has become somewhat simpler, the most important factor in our answering the question of the probability has not changed. We still do not know the outcome.

Moreover, if the person flipping the coin looks at the coin, that does not and cannot change our answer.

When WE look at the coin, however, we now suddenly, miraculously declare that the probability is now 100% that the coin has come up heads. Nothing has changed physically, but our estimate has changed dramatically.

Moreover, if we now examine the coin and find that it has two heads, our previous answer of 50% is still valid in the original context. If we were to repeat the experiment, our correct interpretation is to give 100% as the probability before the flip. The only difference is our state of knowledge.

So philosophically speaking, probability is a statement of knowledge.

Moreover, by de Finetti's famous theorem, even if this philosophical argument is bogus, the mathematics all works our AS IF there were an underlying distribution on the parameters of the system. That means that we can profitably use this philosophical argument AS IF it were true.

The upshot is that even if you are a frequentist in your heart of hearts, it will still pay to behave as if you were a Bayesian. And I, as a Bayesian, will be able to behave as if you were rational because I will not know your secret.

Friday, March 28, 2008

How Not to Recommend

A friend of mine pointed out The Netflix Prize: 300 Days Later this morning. The article points out how difficult it has been to make significant improvements on the Netflix rating prediction problem.

This article reinforces a lot of my own feelings. I continue to believe that recommendations based on ratings are inherently flawed. Among other things

predicting ratings from ratings is hard
predicting real user preferences from ratings is even harder
most people don’t rate things very much, if at all (makes #1 even harder)

All of these factors are eased considerably if you just get rid of the idea that you have to predict ratings. Observing real behavior and then predicting that same behavior is vastly easier.

Wednesday, March 26, 2008

The Power of Meta

Evolutionary algorithms are oohhh so trendy, but they often provide very poor convergence rates. This occurs particularly when the population is in the neighborhood of a good solution, but the mutations being produced are far from that neighborhood because the mutation rate is too high.

Decreasing the mutation rate often doesn't help because then it takes forever to find the right neighborhood.

Changing the mutation rate externally can work, but you will usually get the annealing schedule wrong and have too much or too little mutation at some point.

One of the most effective ways to fix these problems is the technique known as meta-mutation. With meta-mutation, each member of the population has data about how dramatically it should be mutated. Then, when you are far from a good solution, members that take big steps can be big winners and their descendants will also take big steps. When you get close to a good solution, members with large mutation rates will step far from the best solution and will be mostly unable to improve the solution, but some descendants will have smaller mutation rates and these will able to find improved solutions. These calmer descendants will quickly dominate the population because they will be able to produce significant improvements.

Unfortunately, meta-mutation introduces the new problem of how you should mutate the mutation rate. (danger ... recursion alert)

I describe some methods that avoid all of these problems in my 1998 paper Recorded Step Mutation for Faster Convergence. The techniques are very simple and very effective. The paper is available from arxiv.

The real trick is that you can have the mutation rate encode the meta-mutation rate as well. One approach using a fixed distribution that is scaled by the current mutation rate. Another is to use the last step determine a directional component of the new mutation rate. Together, these approaches can improve convergence rate and final accuracy by many orders of magnitude relative to other approaches. Self-similar meta-mutation alone gives results that are nearly identical in symmetric Gaussian bowl problems to analytically determined optimal annealing schedules. When combined with directional mutation, results are near optimal for Gaussian bowls with low rank cross correlations.

Tuesday, March 25, 2008

Hello world for map-reduce

There has been a surge of interest lately in map-reduce based parallel from many quarters. The public, non-google side of the that interest is largely focussed around an open source system of software known as Hadoop.

Lots of people are doing some very cool things with Hadoop. Yahoo is now running key parts of their web indexing infrastructure using Hadoop. Academics have described very impressive results in large scale machine translation developments. Facebook uses Hadoop for key parts of their metrics pipeline. IBM has developed a functional programming system on Hadoop. Imitators have been spawned.

But one key ingredient of any new computing technology is still missing.

What Hadoop has been missing to date is a good way to write a hello-world application in just a few lines of code. It has often been said that the ability to do simple things simply is a critical capability for a new technology. Hadoop, and map-reduce programming in general, does not lack a good hello-world example; the word-counting example provides that. What is missing is the ability to write this program in just a few lines of code. The word-count example that comes with the Hadoop distribution is over a hundred lines of code, most of which are impossible to motivate for a new user.

The basic idea is simple, the map phase tokenizes input lines and emits words as keys with (initial) counts of 1. The reduce phase takes all of the counts for a single word and sums them up. As a refinement, we can use a combiner to pre-reduce counts within the output of a single mapper.

So why is this more than 5 lines of code? The answer is that running a real world map-reduce cluster has some real complexity if you want the best possible results. Another reason is that in Java hiding complex implementations requires lots of noise tokens. It isn't that Java is bad, it is just that it isn't good for hiding complexity from naive users.

My answer to this problem is to move to Groovy, a language whose raison d'etre is specifically to hide the cruft in Java programs. Groovy isn't Java or an extension of Java, but Groovy programs inter-call with Java programs completely transparently and compile to the same byte codes.

So what should word-count look like? I think that it should look like functional programming. Our mapper should look like a function as should our reducer. They should be combined to produce a map-reduce program which should itself be a function that takes inputs and produces outputs. Those inputs should be any kind of input that we might like to use including lists of strings (for testing), local files and files stored in a distributed data store.

So our program really ought to look like this:

  wc = mapReduce(
  {key, line, out, report -> line.split().each {w -> out.collect(w,1)}},
  {w, counts, out, report -> out.collect(w, counts.sum())}
 )

and we should be able to use this program in a lot of different ways:

  wc(["this is input text", "for the word-counter"])
 wc(new File("/local/input/file")
 wc(findInterestingTextUsingMapReduce(new HdfsFile("/existing/distributed/file")

and we should be able to access the output very simply:

  wc(["this is input text"]).eachLine {
     println it
 }

Subject to some limitations in Groovy 1.5.4, this is exactly what my new system does. I call this new system grool because it is pretty thin porridge right now.

Send me email for a copy. I will get it onto a standard kind of open-source repository soon(ish), but you can try it sooner.

Friday, March 21, 2008

Some years ago, I wrote a simple paper, Accurate Methods for the Statistics of Surprise and Coincidence that has since seen quite a history of citation and use by others. The basic method was applied to get highly accurate language identification, species identification, author identification, document classification, musical recommendations, insurance risk assessment and video recommendations. It was eventually the core of my doctoral dissertation (ask me for a copy) and has been cited hundreds of times.

What hasn't been well described in the past is just how simple the method really is. In most of the applications above, it is important to find useful features in large amounts of data. The method at the heart of all of these algorithms is to use a score to analyze counts of events, particularly counts of when events occur together. The counts that you usually have in these situations are the number of times two events have occurred together, the number of times that they have occurred with or without each other and the number of times anything has occurred.

To compute the score, it is handy to restate these counts slightly as the number of times the events occurred together (let's call this k_11), the number of times each has occurred without the other (let's call these k_12 and k_21) and the number of times something has been observed that was neither of these events (let's call this k_22). In this notation, I am using row or column 1 to be the event of interest and row or column 2 to be everything else.

Here is the table we need

	Event A	Everything but A
Event B	A and B together (k_11)	B, but not A (k_12)
Everything but B	A without B (k_21)	Neither A nor B (k_22)

Once you have these counts, the hard part is over. Computing the log-likelihood ratio score (also known as G²) is very simple,

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k)). In R, this function is defined as

H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}

The trick with adding k==0 handles the case where some element of k is zero. Since we are multiplying by that same zero, we can drop those terms but it helps not to try to compute the log(0). The java or C versions of this (ask me for a copy) is almost as simple, but not quite as pretty.

So there you have it. Two lines of code and the world is your oyster.

Why the title?

What kind of title, you may ask, is "Parallel and Uncertain".

If you were to ask, I would answer, "descriptive".

My main interests lately have to do with parallel processing and (automated) reasoning in the fact of uncertainty. Thus, these blog posts will be largely about these topics.

What more can I say?

Surprise and Coincidence - musings from the long tail