Friday, March 21, 2008

Surprise and Coincidence

Some years ago, I wrote a simple paper, Accurate Methods for the Statistics of Surprise and Coincidence that has since seen quite a history of citation and use by others. The basic method was applied to get highly accurate language identification, species identification, author identification, document classification, musical recommendations, insurance risk assessment and video recommendations. It was eventually the core of my doctoral dissertation (ask me for a copy) and has been cited hundreds of times.

What hasn't been well described in the past is just how simple the method really is. In most of the applications above, it is important to find useful features in large amounts of data. The method at the heart of all of these algorithms is to use a score to analyze counts of events, particularly counts of when events occur together. The counts that you usually have in these situations are the number of times two events have occurred together, the number of times that they have occurred with or without each other and the number of times anything has occurred.

To compute the score, it is handy to restate these counts slightly as the number of times the events occurred together (let's call this k_11), the number of times each has occurred without the other (let's call these k_12 and k_21) and the number of times something has been observed that was neither of these events (let's call this k_22). In this notation, I am using row or column 1 to be the event of interest and row or column 2 to be everything else.

Here is the table we need



Event A Everything but A
Event B A and B together (k_11) B, but not A (k_12)
Everything but B A without B (k_21) Neither A nor B (k_22)

Once you have these counts, the hard part is over. Computing the log-likelihood ratio score (also known as G2) is very simple,

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k)). In R, this function is defined as

H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}

The trick with adding k==0 handles the case where some element of k is zero. Since we are multiplying by that same zero, we can drop those terms but it helps not to try to compute the log(0). The java or C versions of this (ask me for a copy) is almost as simple, but not quite as pretty.

So there you have it. Two lines of code and the world is your oyster.








29 comments:

Dustin Boswell said...

Hi Ted, I'm using your likelihood test to find collocations in text. I see how to apply it to find bigrams, but what about trigrams, etc..?

Ted Dunning ... apparently Bayesian said...

There are several approaches. Probably the simplest is a multi-pass approach where you first find good bigrams, then re-interpret those as single tokens and find interesting bigrams where one component is itself composite. You can obviously repeat that any number of times.

The other approach is to build a log-likelihood ratio test based on an order-2 Markov model versus an order-1 Markov model. The Markov models in question predict a word based on the preceding two words or the preceding one word of context. In my dissertation, I do a complete derivation of this, but I now think that it would be easier to use the mutual information trick to see how much information gain there is for using the larger context.

I can send you a copy of my dissertation if you like. My gmail address is ted.dunning.

wataru kasai said...

Hi,
I tried your score in Mahout and it works practically very good for many recommendation tasks.
Yet, here I would like to discuss an irregular case potentially occurs.
In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.
And around that, the score increases with decrease of k1(cross viewed count).
This behavior seems not so good. Are there any solution exist about such a case?

wataru kasai said...

Hi,
I tried your log-likelihood score in Mahout and it works practically very good for many recommendation tasks.
Yet, here I would like to discuss an irregular case potentially occurs.
In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.
And around that, the score increases with decrease of k1(cross viewed count).
This behavior seems not so good and it might be caused by the model used in the test formula. Are there any solution or discussion already exist about this problem?

Ted Dunning ... apparently Bayesian said...

Wataru

In that case the score should be zero since the frequencies are identical. Any deviations will cause a positive score.

If you want the score to go above or below zero according to whether the value of k1 is above or below the expected value relative to k2 then you can use the variant of the score that I call the signed root log likelihood ratio. The basic idea is that you take the square root and apply a sign according to whether k1 is above or below the expected value. Mahout has an implementation of this.

wataru kasai said...

Thank you for your reply!
I would like to study about the theory of your signed-root LLR, cause I can't catch its meaning at a glance. But it seems work well & I try it.

Ted Dunning ... apparently Bayesian said...

The theory is actually fairly simple. A random variable distributed as $\chi^2$ will have a square root that is normally distributed except for sign. The restoration is the sign here is symmetrical only in the asymptotic case but for practical applications that doesn't matter.

Mat Kelcey said...

Hi Ted,

Can you elaborate a bit on the trigram comment? I don't quite get it..

For bigrams I've used in the past

mutual_info(t1,t2) = (p(t1)*p(t2)) / p(t1,t2)

Are you saying for trigrams it's

mutual_info(t1,t2,t3) = (p(t1,t2)*p(t2,t3)) / p(t1,t2,t3)

I think I don't quite get the "one component is itself composite." comment.

Cheers!
Mat

Ted Dunning ... apparently Bayesian said...

Mat,

Your definition of mutual information is a bit off and is very subject to errors with small counts. I would worry a bit. If you use the correct definition then I think that multiplying by the number of samples is a good thing since that gives you a measure of anomaly rather than association. For a threshold, anomaly is really what you are after ... other mechanisms are usually used for weighting anyway.

Regarding trigrams, one of the easiest implementations is to simply make a pass looking for significant bigrams. Then, considering those bigrams as single words, make another pass looking for bigrams that include bigrams from the first pass. These composite bigrams are really trigrams. You can repeat this process as you like.

Mathematically, this is a bit off, but the implementation easy really outweighs the values of orthodoxy in this case.

Ed said...

Hi Ted,

Thanks for the blog post and paper -- it's great that you've gone through the trouble of explaining the usefulness of LLR as a similarity score.

I'm having some practical problems with using item similarities when the number of (binary) coratings for different pairs of items varies over several orders of magnitude.

Since LLR (or signed-sqrt-LLR) is a test statistic, its value for each item pair scales linearly (or as the square root) with the number of coratings N.

The problem is the following: for each item I am picking out some number of 'most similar items'. If I use the statistic that depends on N, the 'most similar items' are items that have a ton of coratings but aren't necessarily all that similar. If I normalize so that they don't depend on N, then the 'most similar items' tend to be those with small numbers of coratings, since those similarity scores will be very noisy.

What I'm doing right now is an awful kludge where I'm 'partially' normalizing by some arbitrary fractional power of N and seeing what looks the best by eye. This is dumb for a bajillion reasons.

I bet this is a standard problem and that there is a standard way of dealing with it, especially within the context of a simple recommendation system, but I'm not a statistician or ML expert and have no idea what it is. Thoughts?

Thanks!
Ed

Ted Dunning ... apparently Bayesian said...

Ed,

Thanks for the kind words.

The problem that you are encountering is that LLR is purely a test of association rather than a measure of same. It is best used in a thresholding style that gates the use of other association measures. In that usage, it provides a stop against the over-fitting that simple measures of association tend to exhibit.

As an example, I have used it very effectively to determine whether TF-IDF scores should be used in a document classifier. The TF-IDF scores were not really even a measure of association with the desired topic, but when gated by LLR, the results were stunningly good.

melbic said...

Dear Ted,

I'm currently writing my bachelor thesis with the mahout framework. In the documentation I need some theory, therefor I'm writing about the LogLikelihood-Similarity.
In your Java implementation your describing the the loglikelihood ratio as:
LLR = 2.0 * (matrixEntropy - rowEntropy - columnEntropy)
but here you're describing it as
LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
Why the sum(k)?

Ted Dunning ... apparently Bayesian said...

The quantity

2 * (matrixEntropy - rowEntropy - columnEntropy)

is just mutual information. To get LLR, you need to multiply by the total count.

Mutual information is useful in some applications, but it is a measure of association, not a measure of anomaly. It can be large when there is nothing more than coincidence.

Samir Bajaj said...

(Newbie question)

So the LLR is the measure of similarity (between two "events")? If so, how does the occurrence of events translate to similarity between two items, say two movies?

Thanks.

Ted Dunning ... apparently Bayesian said...

@Samir,

Not quite. The LLR indicates where a significant similarity might exist. The magnitude of the LLR does not indicate the strength of the similarity per se.

I recommend using large LLR as a mask and then estimating the strength of similarity by other means.

Samir Bajaj said...

Thanks...but that's exactly what I'm trying to figure out.

If LLR isn't a measure of similarity, then what exactly does Mahout's implementation of LogLikelihoodSimilarity do?

I've been trying to get my head around this concept for a while, and your blog post is the closest I get to understanding the relationship between LLR and similarity...but you say that they are not quite the same thing.

I appreciate your patience in responding to my questions -- I've been looking for a connection between LLR and similarity, but while there are a lot of tutorials on LLR, there's nothing that ties that concept to that of similarity between two items.

Thanks again.

Ted Dunning ... apparently Bayesian said...

@Samir,

What the Mahout method does is use LLR to find interesting connections and then it sets the similarity for those pairs to 1 and all else to 0.

It may also weight be inverse frequency. I haven't looked lately and it may do that.

sunnydayal vanambathina said...

HI, iam working on Speech enhancement.I want to compare results by using Method A and Method B.Which method is said to be good for waht value of LLR?(EXAMPLE:If LLR of mthod A is greater than Method B???)

Ted Dunning ... apparently Bayesian said...

sunnydayal vanambathina said...

HI, iam working on Speech enhancement.I want to compare results by using Method A and Method B.Which method is said to be good for waht value of LLR?(EXAMPLE:If LLR of mthod A is greater than Method B???)


I am not at all clear about what you will be doing. Generally, tasks like speech recognition have their own figures of merit such as perplexity. With speech enhancement, I would guess that you have more subjective metrics.

How did you plan to use LLR?

Vikas Kumar said...

Hi Ted,
Need small input here. I am working with a data similar to user-item in a recommendation scenario i.e.
,
,
,
etc..
My aim is to compare the different similarity measures between each items i.e. , etc and form a graph to perform clusters and see which one is doing a better job of giving me meaningful clusters.
I already have normalized mutual information as similarity measure for the graph. I want to check LLR for the same. As per, your comments the only modification I would require is multiplying the "total count" into MI. Is the "total count" is total number of users in the system, or is this the the total count of users across i1 (event A) and i2 (event B)?

Vikas Kumar said...

Hi Ted,
Need small input here. I am working with a data similar to user-item in a recommendation scenario i.e.
,
,
,
etc..
My aim is to compare the different similarity measures between each items i.e. , etc and form a graph to perform clusters and see which one is doing a better job of giving me meaningful clusters.
I already have normalized mutual information as similarity measure for the graph. I want to check LLR for the same. As per, your comments the only modification I would require is multiplying the "total count" into MI. Is the "total count" is total number of users in the system, or is this the the total count of users across i1 (event A) and i2 (event B)?

Ted Dunning ... apparently Bayesian said...

Vikas,

The total here is the total number of observations. What that total is can vary by application and it isn't clear from your description what it should be.

In a normal user x item cooccurrence problem, the question reduces to the number of occurrences of any item.

Vikas Kumar said...

Sorry. I see that some of part of my question was not posted well. The data I was referring to (in 2nd line):
u1, i1
u2,i2
u2,i1
etc...
Hope this makes it clear.

I am not sure if I got it correct from your reply. But here is what I am understanding right now:

The "total count" as total number of users in i1 (event A) and i2 (event B). Since MI (mutual information) can be very high for two items that co-occured by chance, multiplying the total count would affect in removing those biases (occur by chance).
For example:
Case 1: i1 is liked by 70 users, and i2 is liked by 60 (of the same 70 users)
Case 2: i1 and i2 is liked by 2 same users

MI(case 2) > MI (case 1). Which is a wrong signal for similarity. Whereas when multiplied by "total count" to both cases then it becomes a more correct signal.

Let me know if I am understanding it right. Or if I am missing a point here.

Thanks

Dominik Hübner said...

Hey Ted,
I am currently finishing my master thesis building a recommendation engine incorporating views and purchases of products.

Currently, I am using the raw cooccurrence of both events (e.g. (View^T * Purchases) * userViewVector) and get quite satisfying results.

Nevertheless, I wonder how to apply LLR to this problem. Lets say event A is the view of an item and event B is the purchase of this item. k11 would be the cooccurrence of both, k21 would be a view without a purchase afterwards, k12 might never happen since one must first view an item before a purchase and k22 would be all observations without interactions related to this item. Is this correct?

Furthermore, if I now computed a matrix of LLR values indicating which cooccurrence might be useful, would it be useful to multiply the LLR value with the cooccurrence to scale the cooccurrence according to its "confidence"?

Lars Schwabe said...

Hi,

I am puzzled by probably a minor point re a sign: I read

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

as

LLR = f(A,B) * ( H[A,B] - H[A] - H[B] )

where f(A,B) is a count-dependent scaling factor, which has been discussed here already, and H[A,B] and H[A] and H[B] are the entropy of the joint probability and marginals, respectively.

Here comes the question: Isn't the term in brackets actually the *negative* mutual information between A and B, because MI[A,B] = H[A] + H[B] - H[A,B]? Or are you using a different convention here? (See response from May 20, 2012 at 3:17 PM).

Clarification would help me a lot in better shaping my intuition of how to use LLR as the beautifully simple measure to detect and interpret similarities that it certainly is.

Ted Dunning ... apparently Bayesian said...

Lars,

Thanks for the comment.

The definition of H is sign inverted relative to the normal entropy. This makes what you say and what I was saying identical. Sorry about the confusing definition.

The net result is that LLR is 2N times MI and has the same sign. Where you put the minus signs is an implementation detail whose major effect is to confuse people when you change conventions.

Again, sorry to be confusing.

Maysam said...

can you share the C version with me?

Ted Dunning ... apparently Bayesian said...

See GitHub:

https://github.com/tdunning/ancient-stats

Maysam said...

Thank you very much