Friday, March 21, 2008

Surprise and Coincidence

Some years ago, I wrote a simple paper, Accurate Methods for the Statistics of Surprise and Coincidence that has since seen quite a history of citation and use by others. The basic method was applied to get highly accurate language identification, species identification, author identification, document classification, musical recommendations, insurance risk assessment and video recommendations. It was eventually the core of my doctoral dissertation (ask me for a copy) and has been cited hundreds of times.

What hasn't been well described in the past is just how simple the method really is. In most of the applications above, it is important to find useful features in large amounts of data. The method at the heart of all of these algorithms is to use a score to analyze counts of events, particularly counts of when events occur together. The counts that you usually have in these situations are the number of times two events have occurred together, the number of times that they have occurred with or without each other and the number of times anything has occurred.

To compute the score, it is handy to restate these counts slightly as the number of times the events occurred together (let's call this k_11), the number of times each has occurred without the other (let's call these k_12 and k_21) and the number of times something has been observed that was neither of these events (let's call this k_22). In this notation, I am using row or column 1 to be the event of interest and row or column 2 to be everything else.

Here is the table we need



Event A Everything but A
Event B A and B together (k_11) B, but not A (k_12)
Everything but B A without B (k_21) Neither A nor B (k_22)

Once you have these counts, the hard part is over. Computing the log-likelihood ratio score (also known as G2) is very simple,

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k)). In R, this function is defined as

H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}

The trick with adding k==0 handles the case where some element of k is zero. Since we are multiplying by that same zero, we can drop those terms but it helps not to try to compute the log(0). The java or C versions of this (ask me for a copy) is almost as simple, but not quite as pretty.

So there you have it. Two lines of code and the world is your oyster.








19 comments:

Dustin Boswell said...

Hi Ted, I'm using your likelihood test to find collocations in text. I see how to apply it to find bigrams, but what about trigrams, etc..?

Ted Dunning ... apparently Bayesian said...

There are several approaches. Probably the simplest is a multi-pass approach where you first find good bigrams, then re-interpret those as single tokens and find interesting bigrams where one component is itself composite. You can obviously repeat that any number of times.

The other approach is to build a log-likelihood ratio test based on an order-2 Markov model versus an order-1 Markov model. The Markov models in question predict a word based on the preceding two words or the preceding one word of context. In my dissertation, I do a complete derivation of this, but I now think that it would be easier to use the mutual information trick to see how much information gain there is for using the larger context.

I can send you a copy of my dissertation if you like. My gmail address is ted.dunning.

wataru kasai said...

Hi,
I tried your score in Mahout and it works practically very good for many recommendation tasks.
Yet, here I would like to discuss an irregular case potentially occurs.
In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.
And around that, the score increases with decrease of k1(cross viewed count).
This behavior seems not so good. Are there any solution exist about such a case?

wataru kasai said...

Hi,
I tried your log-likelihood score in Mahout and it works practically very good for many recommendation tasks.
Yet, here I would like to discuss an irregular case potentially occurs.
In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.
And around that, the score increases with decrease of k1(cross viewed count).
This behavior seems not so good and it might be caused by the model used in the test formula. Are there any solution or discussion already exist about this problem?

Ted Dunning ... apparently Bayesian said...

Wataru

In that case the score should be zero since the frequencies are identical. Any deviations will cause a positive score.

If you want the score to go above or below zero according to whether the value of k1 is above or below the expected value relative to k2 then you can use the variant of the score that I call the signed root log likelihood ratio. The basic idea is that you take the square root and apply a sign according to whether k1 is above or below the expected value. Mahout has an implementation of this.

wataru kasai said...

Thank you for your reply!
I would like to study about the theory of your signed-root LLR, cause I can't catch its meaning at a glance. But it seems work well & I try it.

Ted Dunning ... apparently Bayesian said...

The theory is actually fairly simple. A random variable distributed as $\chi^2$ will have a square root that is normally distributed except for sign. The restoration is the sign here is symmetrical only in the asymptotic case but for practical applications that doesn't matter.

Mat Kelcey said...

Hi Ted,

Can you elaborate a bit on the trigram comment? I don't quite get it..

For bigrams I've used in the past

mutual_info(t1,t2) = (p(t1)*p(t2)) / p(t1,t2)

Are you saying for trigrams it's

mutual_info(t1,t2,t3) = (p(t1,t2)*p(t2,t3)) / p(t1,t2,t3)

I think I don't quite get the "one component is itself composite." comment.

Cheers!
Mat

Ted Dunning ... apparently Bayesian said...

Mat,

Your definition of mutual information is a bit off and is very subject to errors with small counts. I would worry a bit. If you use the correct definition then I think that multiplying by the number of samples is a good thing since that gives you a measure of anomaly rather than association. For a threshold, anomaly is really what you are after ... other mechanisms are usually used for weighting anyway.

Regarding trigrams, one of the easiest implementations is to simply make a pass looking for significant bigrams. Then, considering those bigrams as single words, make another pass looking for bigrams that include bigrams from the first pass. These composite bigrams are really trigrams. You can repeat this process as you like.

Mathematically, this is a bit off, but the implementation easy really outweighs the values of orthodoxy in this case.

Ed said...

Hi Ted,

Thanks for the blog post and paper -- it's great that you've gone through the trouble of explaining the usefulness of LLR as a similarity score.

I'm having some practical problems with using item similarities when the number of (binary) coratings for different pairs of items varies over several orders of magnitude.

Since LLR (or signed-sqrt-LLR) is a test statistic, its value for each item pair scales linearly (or as the square root) with the number of coratings N.

The problem is the following: for each item I am picking out some number of 'most similar items'. If I use the statistic that depends on N, the 'most similar items' are items that have a ton of coratings but aren't necessarily all that similar. If I normalize so that they don't depend on N, then the 'most similar items' tend to be those with small numbers of coratings, since those similarity scores will be very noisy.

What I'm doing right now is an awful kludge where I'm 'partially' normalizing by some arbitrary fractional power of N and seeing what looks the best by eye. This is dumb for a bajillion reasons.

I bet this is a standard problem and that there is a standard way of dealing with it, especially within the context of a simple recommendation system, but I'm not a statistician or ML expert and have no idea what it is. Thoughts?

Thanks!
Ed

Ted Dunning ... apparently Bayesian said...

Ed,

Thanks for the kind words.

The problem that you are encountering is that LLR is purely a test of association rather than a measure of same. It is best used in a thresholding style that gates the use of other association measures. In that usage, it provides a stop against the over-fitting that simple measures of association tend to exhibit.

As an example, I have used it very effectively to determine whether TF-IDF scores should be used in a document classifier. The TF-IDF scores were not really even a measure of association with the desired topic, but when gated by LLR, the results were stunningly good.

melbic said...

Dear Ted,

I'm currently writing my bachelor thesis with the mahout framework. In the documentation I need some theory, therefor I'm writing about the LogLikelihood-Similarity.
In your Java implementation your describing the the loglikelihood ratio as:
LLR = 2.0 * (matrixEntropy - rowEntropy - columnEntropy)
but here you're describing it as
LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
Why the sum(k)?

Ted Dunning ... apparently Bayesian said...

The quantity

2 * (matrixEntropy - rowEntropy - columnEntropy)

is just mutual information. To get LLR, you need to multiply by the total count.

Mutual information is useful in some applications, but it is a measure of association, not a measure of anomaly. It can be large when there is nothing more than coincidence.

Samir Bajaj said...

(Newbie question)

So the LLR is the measure of similarity (between two "events")? If so, how does the occurrence of events translate to similarity between two items, say two movies?

Thanks.

Ted Dunning ... apparently Bayesian said...

@Samir,

Not quite. The LLR indicates where a significant similarity might exist. The magnitude of the LLR does not indicate the strength of the similarity per se.

I recommend using large LLR as a mask and then estimating the strength of similarity by other means.

Samir Bajaj said...

Thanks...but that's exactly what I'm trying to figure out.

If LLR isn't a measure of similarity, then what exactly does Mahout's implementation of LogLikelihoodSimilarity do?

I've been trying to get my head around this concept for a while, and your blog post is the closest I get to understanding the relationship between LLR and similarity...but you say that they are not quite the same thing.

I appreciate your patience in responding to my questions -- I've been looking for a connection between LLR and similarity, but while there are a lot of tutorials on LLR, there's nothing that ties that concept to that of similarity between two items.

Thanks again.

Ted Dunning ... apparently Bayesian said...

@Samir,

What the Mahout method does is use LLR to find interesting connections and then it sets the similarity for those pairs to 1 and all else to 0.

It may also weight be inverse frequency. I haven't looked lately and it may do that.

sunnydayal vanambathina said...

HI, iam working on Speech enhancement.I want to compare results by using Method A and Method B.Which method is said to be good for waht value of LLR?(EXAMPLE:If LLR of mthod A is greater than Method B???)

Ted Dunning ... apparently Bayesian said...

sunnydayal vanambathina said...

HI, iam working on Speech enhancement.I want to compare results by using Method A and Method B.Which method is said to be good for waht value of LLR?(EXAMPLE:If LLR of mthod A is greater than Method B???)


I am not at all clear about what you will be doing. Generally, tasks like speech recognition have their own figures of merit such as perplexity. With speech enhancement, I would guess that you have more subjective metrics.

How did you plan to use LLR?