**Accurate Methods for the Statistics of Surprise and Coincidence**that has since seen quite a history of citation and use by others. The basic method was applied to get highly accurate language identification, species identification, author identification, document classification, musical recommendations, insurance risk assessment and video recommendations. It was eventually the core of my doctoral dissertation (ask me for a copy) and has been cited hundreds of times.

What hasn't been well described in the past is just how simple the method really is. In most of the applications above, it is important to find useful features in large amounts of data. The method at the heart of all of these algorithms is to use a score to analyze counts of events, particularly counts of when events occur together. The counts that you usually have in these situations are the number of times two events have occurred together, the number of times that they have occurred with or without each other and the number of times anything has occurred.

To compute the score, it is handy to restate these counts slightly as the number of times the events occurred together (let's call this k_11), the number of times each has occurred without the other (let's call these k_12 and k_21) and the number of times something has been observed that was neither of these events (let's call this k_22). In this notation, I am using row or column 1 to be the event of interest and row or column 2 to be everything else.

Here is the table we need

Event A | Everything but A | |

Event B | A and B together (k_11) | B, but not A (k_12) |

Everything but B | A without B (k_21) | Neither A nor B (k_22) |

Once you have these counts, the hard part is over. Computing the log-likelihood ratio score (also known as G

^{2}) is very simple,

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k)). In R, this function is defined as

H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}

The trick with adding k==0 handles the case where some element of k is zero. Since we are multiplying by that same zero, we can drop those terms but it helps not to try to compute the log(0). The java or C versions of this (ask me for a copy) is almost as simple, but not quite as pretty.

So there you have it. Two lines of code and the world is your oyster.

## 56 comments:

Hi Ted, I'm using your likelihood test to find collocations in text. I see how to apply it to find bigrams, but what about trigrams, etc..?

There are several approaches. Probably the simplest is a multi-pass approach where you first find good bigrams, then re-interpret those as single tokens and find interesting bigrams where one component is itself composite. You can obviously repeat that any number of times.

The other approach is to build a log-likelihood ratio test based on an order-2 Markov model versus an order-1 Markov model. The Markov models in question predict a word based on the preceding two words or the preceding one word of context. In my dissertation, I do a complete derivation of this, but I now think that it would be easier to use the mutual information trick to see how much information gain there is for using the larger context.

I can send you a copy of my dissertation if you like. My gmail address is ted.dunning.

Hi,

I tried your score in Mahout and it works practically very good for many recommendation tasks.

Yet, here I would like to discuss an irregular case potentially occurs.

In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.

And around that, the score increases with decrease of k1(cross viewed count).

This behavior seems not so good. Are there any solution exist about such a case?

Hi,

I tried your log-likelihood score in Mahout and it works practically very good for many recommendation tasks.

Yet, here I would like to discuss an irregular case potentially occurs.

In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.

And around that, the score increases with decrease of k1(cross viewed count).

This behavior seems not so good and it might be caused by the model used in the test formula. Are there any solution or discussion already exist about this problem?

Wataru

In that case the score should be zero since the frequencies are identical. Any deviations will cause a positive score.

If you want the score to go above or below zero according to whether the value of k1 is above or below the expected value relative to k2 then you can use the variant of the score that I call the signed root log likelihood ratio. The basic idea is that you take the square root and apply a sign according to whether k1 is above or below the expected value. Mahout has an implementation of this.

Thank you for your reply!

I would like to study about the theory of your signed-root LLR, cause I can't catch its meaning at a glance. But it seems work well & I try it.

The theory is actually fairly simple. A random variable distributed as $\chi^2$ will have a square root that is normally distributed except for sign. The restoration is the sign here is symmetrical only in the asymptotic case but for practical applications that doesn't matter.

Hi Ted,

Can you elaborate a bit on the trigram comment? I don't quite get it..

For bigrams I've used in the past

mutual_info(t1,t2) = (p(t1)*p(t2)) / p(t1,t2)

Are you saying for trigrams it's

mutual_info(t1,t2,t3) = (p(t1,t2)*p(t2,t3)) / p(t1,t2,t3)

I think I don't quite get the "one component is itself composite." comment.

Cheers!

Mat

Mat,

Your definition of mutual information is a bit off and is very subject to errors with small counts. I would worry a bit. If you use the correct definition then I think that multiplying by the number of samples is a good thing since that gives you a measure of anomaly rather than association. For a threshold, anomaly is really what you are after ... other mechanisms are usually used for weighting anyway.

Regarding trigrams, one of the easiest implementations is to simply make a pass looking for significant bigrams. Then, considering those bigrams as single words, make another pass looking for bigrams that include bigrams from the first pass. These composite bigrams are really trigrams. You can repeat this process as you like.

Mathematically, this is a bit off, but the implementation easy really outweighs the values of orthodoxy in this case.

Hi Ted,

Thanks for the blog post and paper -- it's great that you've gone through the trouble of explaining the usefulness of LLR as a similarity score.

I'm having some practical problems with using item similarities when the number of (binary) coratings for different pairs of items varies over several orders of magnitude.

Since LLR (or signed-sqrt-LLR) is a test statistic, its value for each item pair scales linearly (or as the square root) with the number of coratings N.

The problem is the following: for each item I am picking out some number of 'most similar items'. If I use the statistic that depends on N, the 'most similar items' are items that have a ton of coratings but aren't necessarily all that similar. If I normalize so that they don't depend on N, then the 'most similar items' tend to be those with small numbers of coratings, since those similarity scores will be very noisy.

What I'm doing right now is an awful kludge where I'm 'partially' normalizing by some arbitrary fractional power of N and seeing what looks the best by eye. This is dumb for a bajillion reasons.

I bet this is a standard problem and that there is a standard way of dealing with it, especially within the context of a simple recommendation system, but I'm not a statistician or ML expert and have no idea what it is. Thoughts?

Thanks!

Ed

Ed,

Thanks for the kind words.

The problem that you are encountering is that LLR is purely a test of association rather than a measure of same. It is best used in a thresholding style that gates the use of other association measures. In that usage, it provides a stop against the over-fitting that simple measures of association tend to exhibit.

As an example, I have used it very effectively to determine whether TF-IDF scores should be used in a document classifier. The TF-IDF scores were not really even a measure of association with the desired topic, but when gated by LLR, the results were stunningly good.

Dear Ted,

I'm currently writing my bachelor thesis with the mahout framework. In the documentation I need some theory, therefor I'm writing about the LogLikelihood-Similarity.

In your Java implementation your describing the the loglikelihood ratio as:

LLR = 2.0 * (matrixEntropy - rowEntropy - columnEntropy)

but here you're describing it as

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

Why the sum(k)?

The quantity

2 * (matrixEntropy - rowEntropy - columnEntropy)

is just mutual information. To get LLR, you need to multiply by the total count.

Mutual information is useful in some applications, but it is a measure of association, not a measure of anomaly. It can be large when there is nothing more than coincidence.

(Newbie question)

So the LLR

isthe measure of similarity (between two "events")? If so, how does the occurrence of events translate to similarity between two items, say two movies?Thanks.

@Samir,

Not quite. The LLR indicates where a significant similarity might exist. The magnitude of the LLR does not indicate the strength of the similarity per se.

I recommend using large LLR as a mask and then estimating the strength of similarity by other means.

Thanks...but that's exactly what I'm trying to figure out.

If LLR isn't a measure of similarity, then what exactly does Mahout's implementation of LogLikelihoodSimilarity do?

I've been trying to get my head around this concept for a while, and your blog post is the closest I get to understanding the relationship between LLR and similarity...but you say that they are not quite the same thing.

I appreciate your patience in responding to my questions -- I've been looking for a connection between LLR and similarity, but while there are a lot of tutorials on LLR, there's nothing that ties that concept to that of similarity between two items.

Thanks again.

@Samir,

What the Mahout method does is use LLR to find interesting connections and then it sets the similarity for those pairs to 1 and all else to 0.

It may also weight be inverse frequency. I haven't looked lately and it may do that.

HI, iam working on Speech enhancement.I want to compare results by using Method A and Method B.Which method is said to be good for waht value of LLR?(EXAMPLE:If LLR of mthod A is greater than Method B???)

sunnydayal vanambathina said...HI, iam working on Speech enhancement.I want to compare results by using Method A and Method B.Which method is said to be good for waht value of LLR?(EXAMPLE:If LLR of mthod A is greater than Method B???)

I am not at all clear about what you will be doing. Generally, tasks like speech recognition have their own figures of merit such as perplexity. With speech enhancement, I would guess that you have more subjective metrics.

How did you plan to use LLR?

Hi Ted,

Need small input here. I am working with a data similar to user-item in a recommendation scenario i.e.

,

,

,

etc..

My aim is to compare the different similarity measures between each items i.e. , etc and form a graph to perform clusters and see which one is doing a better job of giving me meaningful clusters.

I already have normalized mutual information as similarity measure for the graph. I want to check LLR for the same. As per, your comments the only modification I would require is multiplying the "total count" into MI. Is the "total count" is total number of users in the system, or is this the the total count of users across i1 (event A) and i2 (event B)?

Hi Ted,

Need small input here. I am working with a data similar to user-item in a recommendation scenario i.e.

,

,

,

etc..

My aim is to compare the different similarity measures between each items i.e. , etc and form a graph to perform clusters and see which one is doing a better job of giving me meaningful clusters.

I already have normalized mutual information as similarity measure for the graph. I want to check LLR for the same. As per, your comments the only modification I would require is multiplying the "total count" into MI. Is the "total count" is total number of users in the system, or is this the the total count of users across i1 (event A) and i2 (event B)?

Vikas,

The total here is the total number of observations. What that total is can vary by application and it isn't clear from your description what it should be.

In a normal user x item cooccurrence problem, the question reduces to the number of occurrences of any item.

Sorry. I see that some of part of my question was not posted well. The data I was referring to (in 2nd line):

u1, i1

u2,i2

u2,i1

etc...

Hope this makes it clear.

I am not sure if I got it correct from your reply. But here is what I am understanding right now:

The "total count" as total number of users in i1 (event A) and i2 (event B). Since MI (mutual information) can be very high for two items that co-occured by chance, multiplying the total count would affect in removing those biases (occur by chance).

For example:

Case 1: i1 is liked by 70 users, and i2 is liked by 60 (of the same 70 users)

Case 2: i1 and i2 is liked by 2 same users

MI(case 2) > MI (case 1). Which is a wrong signal for similarity. Whereas when multiplied by "total count" to both cases then it becomes a more correct signal.

Let me know if I am understanding it right. Or if I am missing a point here.

Thanks

Sorry. I see that some of part of my question was not posted well. The data I was referring to (in 2nd line):

u1, i1

u2,i2

u2,i1

etc...

Hope this makes it clear.

I am not sure if I got it correct from your reply. But here is what I am understanding right now:

The "total count" as total number of users in i1 (event A) and i2 (event B). Since MI (mutual information) can be very high for two items that co-occured by chance, multiplying the total count would affect in removing those biases (occur by chance).

For example:

Case 1: i1 is liked by 70 users, and i2 is liked by 60 (of the same 70 users)

Case 2: i1 and i2 is liked by 2 same users

MI(case 2) > MI (case 1). Which is a wrong signal for similarity. Whereas when multiplied by "total count" to both cases then it becomes a more correct signal.

Let me know if I am understanding it right. Or if I am missing a point here.

Thanks

Hey Ted,

I am currently finishing my master thesis building a recommendation engine incorporating views and purchases of products.

Currently, I am using the raw cooccurrence of both events (e.g. (View^T * Purchases) * userViewVector) and get quite satisfying results.

Nevertheless, I wonder how to apply LLR to this problem. Lets say event A is the view of an item and event B is the purchase of this item. k11 would be the cooccurrence of both, k21 would be a view without a purchase afterwards, k12 might never happen since one must first view an item before a purchase and k22 would be all observations without interactions related to this item. Is this correct?

Furthermore, if I now computed a matrix of LLR values indicating which cooccurrence might be useful, would it be useful to multiply the LLR value with the cooccurrence to scale the cooccurrence according to its "confidence"?

Hi,

I am puzzled by probably a minor point re a sign: I read

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

as

LLR = f(A,B) * ( H[A,B] - H[A] - H[B] )

where f(A,B) is a count-dependent scaling factor, which has been discussed here already, and H[A,B] and H[A] and H[B] are the entropy of the joint probability and marginals, respectively.

Here comes the question: Isn't the term in brackets actually the *negative* mutual information between A and B, because MI[A,B] = H[A] + H[B] - H[A,B]? Or are you using a different convention here? (See response from May 20, 2012 at 3:17 PM).

Clarification would help me a lot in better shaping my intuition of how to use LLR as the beautifully simple measure to detect and interpret similarities that it certainly is.

Hi,

I am puzzled by probably a minor point re a sign: I read

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

as

LLR = f(A,B) * ( H[A,B] - H[A] - H[B] )

where f(A,B) is a count-dependent scaling factor, which has been discussed here already, and H[A,B] and H[A] and H[B] are the entropy of the joint probability and marginals, respectively.

Here comes the question: Isn't the term in brackets actually the *negative* mutual information between A and B, because MI[A,B] = H[A] + H[B] - H[A,B]? Or are you using a different convention here? (See response from May 20, 2012 at 3:17 PM).

Clarification would help me a lot in better shaping my intuition of how to use LLR as the beautifully simple measure to detect and interpret similarities that it certainly is.

Lars,

Thanks for the comment.

The definition of H is sign inverted relative to the normal entropy. This makes what you say and what I was saying identical. Sorry about the confusing definition.

The net result is that LLR is 2N times MI and has the same sign. Where you put the minus signs is an implementation detail whose major effect is to confuse people when you change conventions.

Again, sorry to be confusing.

can you share the C version with me?

See GitHub:

https://github.com/tdunning/ancient-stats

Thank you very much

Hi Ted,

I'm having difficulty in giving a probabilistic interpretation to LLR in the case of collaborative filtering. Can you state what do events A and B correspond to if we build a contingency table for two users in the context of collaborative filtering.

Thanks

Mario,

I would love to help, but I don't quite understand why a probabilistic interpretation of LLR would do you any good. The LLR stuff in terms of collaborative filtering is a totally deterministic way to get indicators which are then also used deterministically. Theoretical analysis is essentially worthless here because of the complex nature of the actual data.

Empirical analysis has been done by Schelter et al in

http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf

In any case, the A and B in this case do not refer to two users, but refer to two actions that users might do. The k_11 cell is the number of users who have done both A and B. k_12 is the number of users who have done A but not B and so on. The goal here is to find events which are linked by cooccurrence in user histories.

The paper above has an excellent description of the way to do this as well as some pragmatic things such as down-sampling.

You can ask questions like this on the Mahout mailing list or directly to my email address which you can find on many of my slide shows on slideshare.

Dear Ted

Thanks for sharing this, and pointing to the original article –both great!

I have a doubt. I perfectly see using the the LL for word-bigrams. But I have a question on how to use it on, say, books purchases. It looks to me the formula is not enough –I might have misunderstood something though.

I understand that if two rare words appear in the same bigram, the "signal" is clearly high. Alwaus. But for books purchase the situation is different. If two rare books appear on the shelves of two customers who bought the majority of books, the signal is not high anymore. At least in one case, the signal was low.

As we have no "bigrams" with books, unless we observe when book B is bought after book A, the formula should take in consideration also the amount of books bought by both users. Am I missing something?

Regards and thanks again!

Mario

Mario,

The idea of cooccurrence can be substantially expanded in the case of products. With words, you consider how many times each word occurs anywhere and how many times they occur in sequence. With items purchased, you typically consider how many times each item was purchased by any user as well as how often the two items are purchased by the same user. That, combined with the total number of users, gives you all four of the numbers you need to use the LLR test.

Mario,

The idea of cooccurrence can be substantially expanded in the case of products. With words, you consider how many times each word occurs anywhere and how many times they occur in sequence. With items purchased, you typically consider how many times each item was purchased by any user as well as how often the two items are purchased by the same user. That, combined with the total number of users, gives you all four of the numbers you need to use the LLR test.

Thanks for you answer Ted.

> how many times each item was purchased by any user

Clearly. The information provided by a rare book is higher than the information provided by Harry Potter.

> how often the two items are purchased by the same user

Still, the information provided by this number is not uniform. Let's take two books which were bought just once, by the a single user.

If such user only bought a dozen books in total, this indicates a high similarity between the two books.

If such user is a big customers, who bought 90% of all books sold on the site, this does not add much to the picture.

Still, in both cases, the weight added into the coocurrence matrix would be the same, whoever customer (big or small) bought the two books together.

In a real case, we have business accounts who would create similarity between two books, although it's clear that they have rare books in their shelves only because they own thousands of books. I'd then prefer a higher weight when the customer owns very few books. I might be wrong, but I saw the problem as the "two people with the same birthday" and computed the information according to a negative binomial distribution. If I have correctly understood, the LLR would not lead to the same result, correct?

Whatever the correctness of my approach above –thanks again for your answer, it was very much appreciated!

Mario

Hi Ted, How do you calculate log-liklihood score for unigram. Would you share the java implementation?

Ashkan,

I am not sure what you mean by your question.

What would you mean by computing LLR for a unigram.

Hi Ted,

I'm confused by this H(rowSums) and H(colSums), what are these?

Let's say that k_11=2, k_12=11, k_21=10 and k_22 = 400.

What would H(colSum) even be? H(12)+H(411)?

Ivana,

The R code should make this clear:

llr = function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}

H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}

Here the functions colSums and rowSums return a vector of values. These are processed by H the same way the original matrix was processed because sum doesn't care if it gets a vector or matrix.

The implementation of the log likelihood ratio in Mahout is 2 * MI, where MI = (Matrix Entropy - Row Entropy - Column Entropy).

But in your blog post you say that it's 2 * N * MI.

Why Mahout implementation decided not to include the total count? Is because Mahout entropy calculation is not normalised?

Thanks,

Leo.-

Leonardo,

The Mahout implementation of MI is un-normalized and has the N factor already baked in. This avoids dividing by N and then multiplying by N again.

Fantastic algorithm. A few questions, which I'll pose one per comment: I need negative and positive, so I need expected value, as you noted. I have this formula:

ev <- function(k) {

# k[1,2] / sum(k) * k[2,1] / sum(k) * sum(k)

k[1,2] / k[2,2] * k[2,1] / k[2,2] * k[2,2]

}

When I run some data, I find empirically that this seems to be right:

[,1] [,2]

[1,] 13 100

[2,] 100 787

llr: 0.00528737 sum: 1000 ev: 12.70648

If I understand LLR, 0 means zero multiples of random chance, and I find that if I give it 13 in this context, that's pretty close to the expected value of 12.7.

Now, notice I commented out the line with the sum(k) formula. Intuitively I thought it would be that. I'm a CS guy, not a math guy, so my intuition is worth about 10 cents. But is my EV formula right? And why? Why is my denominator k[2,2] and not sum(k)?

Mostly want to ensure I have the EV formula right, but curious about any intuition why it is right if it is easy enough to explain.

Thanks!

Next question. LLR naturally gives weight to any strong deviation from chance. Here I present something close to LLR = 0, as well as two strong deviants (note per my other posting, I'm adding a sign using expected value):

[,1] [,2]

[1,] 1 100

[2,] 100 799

llr: -15.66694 sum: 1000 ev: 12.51564

[,1] [,2]

[1,] 13 100

[2,] 100 787

llr: 0.00528737 sum: 1000 ev: 12.70648

[,1] [,2]

[1,] 25 150

[2,] 50 775

llr: 11.98562 sum: 1000 ev: 9.677419

Now, I'm using this as a product recommendation engine. As a constraint of my business problem, I'm interested in recommending only common products, so does that mean I want the high positive LLR values only?

Seems to me, this is analogous to a one-tail versus two-tail problem. If I want *any* deviant event, I want either "tail". But if I want common products for high-volume sales, I want high positive values. Conversely, if I were doing something like fraud detection, I might want only the strong negative values--because I'm looking for rare deviations.

Thoughts? In short, if I'm recommending products for the masses, is it safe to just use high positive values?

Thanks!

Hi Ted, thanks for the terse post. Where could I get a copy of the java or C classes for these computations? My gmail is mulloymorrow.

Mulloy,

You can find the C code here:

https://github.com/tdunning/ancient-stats

and the Java code can be found in Mahout Math library. See

https://github.com/apache/mahout/blob/master/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java

Ted,

What value of rootLLR score can be taken as threshold for registering association between p1 and p2 ? I mean how to decide upon what should we not be excited about.

Ad,

It is hard to take a single number as a threshold because

a) you are probably doing lots of comparisons

b) you may have some pretty strong patterns that are not interesting.

What I usually do is take 3-5 as a firm threshold and then accept the top N scores for some convenient N. For recommendation engines, N = 100-300 is often useful.

Thanks a lot Ted, This is exact answer i was looking for. I am building all together a new concept in services marketplace using this technique of yours!

Ping me when it is working!

Hi Ted!

Is LLR chi square distributed with 1 degree of freedom? If so, than I can use tables for chi square distribution to select threshold for significant co-occurrences with desired level of confidence. Right?

The other question is about formula. In some articles I saw 2*log(MatrixEntropy - RowEntropy - ColumnEntropy) and you have 2*(MatrixEntropy - RowEntropy - ColumnEntropy). It doesn't matter for ranking, but matters for statistical analysis. Which one is right?

Thanks in advance!

Petr,

Yes, LLR is chi^2 distributed exactly as Pearson's test is distributed (asymptotically in both cases). For 2x2 tables, this gives one degree of freedom. With n x m tables, there will be (n-1) (m-1) degrees of freedom.

And yes, you can use standard tables. The only proviso is that the square root of LLR is commonly used with a sign that reflects whether the upper left entry in the contingency table is larger or smaller than expected. Root LLR is normally distributed, ideally.

The formula should be 2 N (MatrixEntropy - RowEntropy - ColumnEntropy), but you have to watch out because un-normalized implementations of entropy are sometimes used which incorporate the factor of N.

Can you point me to articles that use each formula? (especially if I have been inconsistent?)

In this article: http://wortschatz.uni-leipzig.de/~sbordag/papers/BordagMC08.pdf

I saw that formula L=(MatrixEntropy - RowEntropy - ColumnEntropy) and LLR = 2 log(L). It seems to be wrong.

If I'm building a recommender system than 'N' is total number of users, right? If N is big enough, like several millions, than almost every co-occurrence seems to be significant with high confidence since chi square distribution even with alpha=0.005 equals only 7.879 (http://sites.stat.psu.edu/~mga/401/tables/Chi-square-table.pdf). That is a little strange for me.

Regarding you comment about the increasing sensitivity of the test, yes, smaller and smaller effects become visible. Of course, as N increases, you also typically get more and more things that you are testing (more words, more word pairs, more products, whatever). That makes any traditional statistical interpretation quite difficult.

What I recommend is that you simply rank all the results you get rather than choosing a threshold from a table. The biggest ones will tend to be the most interesting.

Oh, I really missed that instead of k's I should use k/sum(k). mahout's spark-itemsimilarity uses plain k's.

Post a Comment