Friday, March 21, 2008

Surprise and Coincidence

Some years ago, I wrote a simple paper, Accurate Methods for the Statistics of Surprise and Coincidence that has since seen quite a history of citation and use by others. The basic method was applied to get highly accurate language identification, species identification, author identification, document classification, musical recommendations, insurance risk assessment and video recommendations. It was eventually the core of my doctoral dissertation (ask me for a copy) and has been cited hundreds of times.

What hasn't been well described in the past is just how simple the method really is. In most of the applications above, it is important to find useful features in large amounts of data. The method at the heart of all of these algorithms is to use a score to analyze counts of events, particularly counts of when events occur together. The counts that you usually have in these situations are the number of times two events have occurred together, the number of times that they have occurred with or without each other and the number of times anything has occurred.

To compute the score, it is handy to restate these counts slightly as the number of times the events occurred together (let's call this k_11), the number of times each has occurred without the other (let's call these k_12 and k_21) and the number of times something has been observed that was neither of these events (let's call this k_22). In this notation, I am using row or column 1 to be the event of interest and row or column 2 to be everything else.

Here is the table we need

 Event A Everything but A Event B A and B together (k_11) B, but not A (k_12) Everything but B A without B (k_21) Neither A nor B (k_22)

Once you have these counts, the hard part is over. Computing the log-likelihood ratio score (also known as G2) is very simple,

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k)). In R, this function is defined as

H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}

The trick with adding k==0 handles the case where some element of k is zero. Since we are multiplying by that same zero, we can drop those terms but it helps not to try to compute the log(0). The java or C versions of this (ask me for a copy) is almost as simple, but not quite as pretty.

So there you have it. Two lines of code and the world is your oyster.

Dustin Boswell said...

Hi Ted, I'm using your likelihood test to find collocations in text. I see how to apply it to find bigrams, but what about trigrams, etc..?

Ted Dunning ... apparently Bayesian said...

There are several approaches. Probably the simplest is a multi-pass approach where you first find good bigrams, then re-interpret those as single tokens and find interesting bigrams where one component is itself composite. You can obviously repeat that any number of times.

The other approach is to build a log-likelihood ratio test based on an order-2 Markov model versus an order-1 Markov model. The Markov models in question predict a word based on the preceding two words or the preceding one word of context. In my dissertation, I do a complete derivation of this, but I now think that it would be easier to use the mutual information trick to see how much information gain there is for using the larger context.

I can send you a copy of my dissertation if you like. My gmail address is ted.dunning.

wataru kasai said...

Hi,
I tried your score in Mahout and it works practically very good for many recommendation tasks.
Yet, here I would like to discuss an irregular case potentially occurs.
In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.
And around that, the score increases with decrease of k1(cross viewed count).
This behavior seems not so good. Are there any solution exist about such a case?

wataru kasai said...

Hi,
I tried your log-likelihood score in Mahout and it works practically very good for many recommendation tasks.
Yet, here I would like to discuss an irregular case potentially occurs.
In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.
And around that, the score increases with decrease of k1(cross viewed count).
This behavior seems not so good and it might be caused by the model used in the test formula. Are there any solution or discussion already exist about this problem?

Ted Dunning ... apparently Bayesian said...

Wataru

In that case the score should be zero since the frequencies are identical. Any deviations will cause a positive score.

If you want the score to go above or below zero according to whether the value of k1 is above or below the expected value relative to k2 then you can use the variant of the score that I call the signed root log likelihood ratio. The basic idea is that you take the square root and apply a sign according to whether k1 is above or below the expected value. Mahout has an implementation of this.

wataru kasai said...

I would like to study about the theory of your signed-root LLR, cause I can't catch its meaning at a glance. But it seems work well & I try it.

Ted Dunning ... apparently Bayesian said...

The theory is actually fairly simple. A random variable distributed as $\chi^2$ will have a square root that is normally distributed except for sign. The restoration is the sign here is symmetrical only in the asymptotic case but for practical applications that doesn't matter.

Mat Kelcey said...

Hi Ted,

Can you elaborate a bit on the trigram comment? I don't quite get it..

For bigrams I've used in the past

mutual_info(t1,t2) = (p(t1)*p(t2)) / p(t1,t2)

Are you saying for trigrams it's

mutual_info(t1,t2,t3) = (p(t1,t2)*p(t2,t3)) / p(t1,t2,t3)

I think I don't quite get the "one component is itself composite." comment.

Cheers!
Mat

Ted Dunning ... apparently Bayesian said...

Mat,

Your definition of mutual information is a bit off and is very subject to errors with small counts. I would worry a bit. If you use the correct definition then I think that multiplying by the number of samples is a good thing since that gives you a measure of anomaly rather than association. For a threshold, anomaly is really what you are after ... other mechanisms are usually used for weighting anyway.

Regarding trigrams, one of the easiest implementations is to simply make a pass looking for significant bigrams. Then, considering those bigrams as single words, make another pass looking for bigrams that include bigrams from the first pass. These composite bigrams are really trigrams. You can repeat this process as you like.

Mathematically, this is a bit off, but the implementation easy really outweighs the values of orthodoxy in this case.

Ed said...

Hi Ted,

Thanks for the blog post and paper -- it's great that you've gone through the trouble of explaining the usefulness of LLR as a similarity score.

I'm having some practical problems with using item similarities when the number of (binary) coratings for different pairs of items varies over several orders of magnitude.

Since LLR (or signed-sqrt-LLR) is a test statistic, its value for each item pair scales linearly (or as the square root) with the number of coratings N.

The problem is the following: for each item I am picking out some number of 'most similar items'. If I use the statistic that depends on N, the 'most similar items' are items that have a ton of coratings but aren't necessarily all that similar. If I normalize so that they don't depend on N, then the 'most similar items' tend to be those with small numbers of coratings, since those similarity scores will be very noisy.

What I'm doing right now is an awful kludge where I'm 'partially' normalizing by some arbitrary fractional power of N and seeing what looks the best by eye. This is dumb for a bajillion reasons.

I bet this is a standard problem and that there is a standard way of dealing with it, especially within the context of a simple recommendation system, but I'm not a statistician or ML expert and have no idea what it is. Thoughts?

Thanks!
Ed

Ted Dunning ... apparently Bayesian said...

Ed,

Thanks for the kind words.

The problem that you are encountering is that LLR is purely a test of association rather than a measure of same. It is best used in a thresholding style that gates the use of other association measures. In that usage, it provides a stop against the over-fitting that simple measures of association tend to exhibit.

As an example, I have used it very effectively to determine whether TF-IDF scores should be used in a document classifier. The TF-IDF scores were not really even a measure of association with the desired topic, but when gated by LLR, the results were stunningly good.

melbic said...

Dear Ted,

I'm currently writing my bachelor thesis with the mahout framework. In the documentation I need some theory, therefor I'm writing about the LogLikelihood-Similarity.
LLR = 2.0 * (matrixEntropy - rowEntropy - columnEntropy)
but here you're describing it as
LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
Why the sum(k)?

Ted Dunning ... apparently Bayesian said...

The quantity

2 * (matrixEntropy - rowEntropy - columnEntropy)

is just mutual information. To get LLR, you need to multiply by the total count.

Mutual information is useful in some applications, but it is a measure of association, not a measure of anomaly. It can be large when there is nothing more than coincidence.

Samir Bajaj said...

(Newbie question)

So the LLR is the measure of similarity (between two "events")? If so, how does the occurrence of events translate to similarity between two items, say two movies?

Thanks.

Ted Dunning ... apparently Bayesian said...

@Samir,

Not quite. The LLR indicates where a significant similarity might exist. The magnitude of the LLR does not indicate the strength of the similarity per se.

I recommend using large LLR as a mask and then estimating the strength of similarity by other means.

Samir Bajaj said...

Thanks...but that's exactly what I'm trying to figure out.

If LLR isn't a measure of similarity, then what exactly does Mahout's implementation of LogLikelihoodSimilarity do?

I've been trying to get my head around this concept for a while, and your blog post is the closest I get to understanding the relationship between LLR and similarity...but you say that they are not quite the same thing.

I appreciate your patience in responding to my questions -- I've been looking for a connection between LLR and similarity, but while there are a lot of tutorials on LLR, there's nothing that ties that concept to that of similarity between two items.

Thanks again.

Ted Dunning ... apparently Bayesian said...

@Samir,

What the Mahout method does is use LLR to find interesting connections and then it sets the similarity for those pairs to 1 and all else to 0.

It may also weight be inverse frequency. I haven't looked lately and it may do that.

sunnydayal vanambathina said...

HI, iam working on Speech enhancement.I want to compare results by using Method A and Method B.Which method is said to be good for waht value of LLR?(EXAMPLE:If LLR of mthod A is greater than Method B???)

Ted Dunning ... apparently Bayesian said...

sunnydayal vanambathina said...

HI, iam working on Speech enhancement.I want to compare results by using Method A and Method B.Which method is said to be good for waht value of LLR?(EXAMPLE:If LLR of mthod A is greater than Method B???)

I am not at all clear about what you will be doing. Generally, tasks like speech recognition have their own figures of merit such as perplexity. With speech enhancement, I would guess that you have more subjective metrics.

How did you plan to use LLR?

Vikas Kumar said...

Hi Ted,
Need small input here. I am working with a data similar to user-item in a recommendation scenario i.e.
,
,
,
etc..
My aim is to compare the different similarity measures between each items i.e. , etc and form a graph to perform clusters and see which one is doing a better job of giving me meaningful clusters.
I already have normalized mutual information as similarity measure for the graph. I want to check LLR for the same. As per, your comments the only modification I would require is multiplying the "total count" into MI. Is the "total count" is total number of users in the system, or is this the the total count of users across i1 (event A) and i2 (event B)?

Vikas Kumar said...

Hi Ted,
Need small input here. I am working with a data similar to user-item in a recommendation scenario i.e.
,
,
,
etc..
My aim is to compare the different similarity measures between each items i.e. , etc and form a graph to perform clusters and see which one is doing a better job of giving me meaningful clusters.
I already have normalized mutual information as similarity measure for the graph. I want to check LLR for the same. As per, your comments the only modification I would require is multiplying the "total count" into MI. Is the "total count" is total number of users in the system, or is this the the total count of users across i1 (event A) and i2 (event B)?

Ted Dunning ... apparently Bayesian said...

Vikas,

The total here is the total number of observations. What that total is can vary by application and it isn't clear from your description what it should be.

In a normal user x item cooccurrence problem, the question reduces to the number of occurrences of any item.

Vikas Kumar said...

Sorry. I see that some of part of my question was not posted well. The data I was referring to (in 2nd line):
u1, i1
u2,i2
u2,i1
etc...
Hope this makes it clear.

I am not sure if I got it correct from your reply. But here is what I am understanding right now:

The "total count" as total number of users in i1 (event A) and i2 (event B). Since MI (mutual information) can be very high for two items that co-occured by chance, multiplying the total count would affect in removing those biases (occur by chance).
For example:
Case 1: i1 is liked by 70 users, and i2 is liked by 60 (of the same 70 users)
Case 2: i1 and i2 is liked by 2 same users

MI(case 2) > MI (case 1). Which is a wrong signal for similarity. Whereas when multiplied by "total count" to both cases then it becomes a more correct signal.

Let me know if I am understanding it right. Or if I am missing a point here.

Thanks

Vikas Kumar said...

Sorry. I see that some of part of my question was not posted well. The data I was referring to (in 2nd line):
u1, i1
u2,i2
u2,i1
etc...
Hope this makes it clear.

I am not sure if I got it correct from your reply. But here is what I am understanding right now:

The "total count" as total number of users in i1 (event A) and i2 (event B). Since MI (mutual information) can be very high for two items that co-occured by chance, multiplying the total count would affect in removing those biases (occur by chance).
For example:
Case 1: i1 is liked by 70 users, and i2 is liked by 60 (of the same 70 users)
Case 2: i1 and i2 is liked by 2 same users

MI(case 2) > MI (case 1). Which is a wrong signal for similarity. Whereas when multiplied by "total count" to both cases then it becomes a more correct signal.

Let me know if I am understanding it right. Or if I am missing a point here.

Thanks

Dominik HÃ¼bner said...

Hey Ted,
I am currently finishing my master thesis building a recommendation engine incorporating views and purchases of products.

Currently, I am using the raw cooccurrence of both events (e.g. (View^T * Purchases) * userViewVector) and get quite satisfying results.

Nevertheless, I wonder how to apply LLR to this problem. Lets say event A is the view of an item and event B is the purchase of this item. k11 would be the cooccurrence of both, k21 would be a view without a purchase afterwards, k12 might never happen since one must first view an item before a purchase and k22 would be all observations without interactions related to this item. Is this correct?

Furthermore, if I now computed a matrix of LLR values indicating which cooccurrence might be useful, would it be useful to multiply the LLR value with the cooccurrence to scale the cooccurrence according to its "confidence"?

Lars Schwabe said...

Hi,

I am puzzled by probably a minor point re a sign: I read

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

as

LLR = f(A,B) * ( H[A,B] - H[A] - H[B] )

where f(A,B) is a count-dependent scaling factor, which has been discussed here already, and H[A,B] and H[A] and H[B] are the entropy of the joint probability and marginals, respectively.

Here comes the question: Isn't the term in brackets actually the *negative* mutual information between A and B, because MI[A,B] = H[A] + H[B] - H[A,B]? Or are you using a different convention here? (See response from May 20, 2012 at 3:17 PM).

Clarification would help me a lot in better shaping my intuition of how to use LLR as the beautifully simple measure to detect and interpret similarities that it certainly is.

Lars Schwabe said...

Hi,

I am puzzled by probably a minor point re a sign: I read

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

as

LLR = f(A,B) * ( H[A,B] - H[A] - H[B] )

where f(A,B) is a count-dependent scaling factor, which has been discussed here already, and H[A,B] and H[A] and H[B] are the entropy of the joint probability and marginals, respectively.

Here comes the question: Isn't the term in brackets actually the *negative* mutual information between A and B, because MI[A,B] = H[A] + H[B] - H[A,B]? Or are you using a different convention here? (See response from May 20, 2012 at 3:17 PM).

Clarification would help me a lot in better shaping my intuition of how to use LLR as the beautifully simple measure to detect and interpret similarities that it certainly is.

Ted Dunning ... apparently Bayesian said...

Lars,

Thanks for the comment.

The definition of H is sign inverted relative to the normal entropy. This makes what you say and what I was saying identical. Sorry about the confusing definition.

The net result is that LLR is 2N times MI and has the same sign. Where you put the minus signs is an implementation detail whose major effect is to confuse people when you change conventions.

Again, sorry to be confusing.

Maysam said...

can you share the C version with me?

Ted Dunning ... apparently Bayesian said...

See GitHub:

https://github.com/tdunning/ancient-stats

Maysam said...

Thank you very much

Mario Levitin said...

Hi Ted,
I'm having difficulty in giving a probabilistic interpretation to LLR in the case of collaborative filtering. Can you state what do events A and B correspond to if we build a contingency table for two users in the context of collaborative filtering.

Thanks

Ted Dunning ... apparently Bayesian said...

Mario,

I would love to help, but I don't quite understand why a probabilistic interpretation of LLR would do you any good. The LLR stuff in terms of collaborative filtering is a totally deterministic way to get indicators which are then also used deterministically. Theoretical analysis is essentially worthless here because of the complex nature of the actual data.

Empirical analysis has been done by Schelter et al in

In any case, the A and B in this case do not refer to two users, but refer to two actions that users might do. The k_11 cell is the number of users who have done both A and B. k_12 is the number of users who have done A but not B and so on. The goal here is to find events which are linked by cooccurrence in user histories.

The paper above has an excellent description of the way to do this as well as some pragmatic things such as down-sampling.

You can ask questions like this on the Mahout mailing list or directly to my email address which you can find on many of my slide shows on slideshare.

Mario said...

Dear Ted

Thanks for sharing this, and pointing to the original article –both great!

I have a doubt. I perfectly see using the the LL for word-bigrams. But I have a question on how to use it on, say, books purchases. It looks to me the formula is not enough –I might have misunderstood something though.

I understand that if two rare words appear in the same bigram, the "signal" is clearly high. Alwaus. But for books purchase the situation is different. If two rare books appear on the shelves of two customers who bought the majority of books, the signal is not high anymore. At least in one case, the signal was low.

As we have no "bigrams" with books, unless we observe when book B is bought after book A, the formula should take in consideration also the amount of books bought by both users. Am I missing something?

Regards and thanks again!
Mario

Ted Dunning ... apparently Bayesian said...

Mario,

The idea of cooccurrence can be substantially expanded in the case of products. With words, you consider how many times each word occurs anywhere and how many times they occur in sequence. With items purchased, you typically consider how many times each item was purchased by any user as well as how often the two items are purchased by the same user. That, combined with the total number of users, gives you all four of the numbers you need to use the LLR test.

Ted Dunning ... apparently Bayesian said...

Mario,

The idea of cooccurrence can be substantially expanded in the case of products. With words, you consider how many times each word occurs anywhere and how many times they occur in sequence. With items purchased, you typically consider how many times each item was purchased by any user as well as how often the two items are purchased by the same user. That, combined with the total number of users, gives you all four of the numbers you need to use the LLR test.

Mario Alemi said...

> how many times each item was purchased by any user

Clearly. The information provided by a rare book is higher than the information provided by Harry Potter.

> how often the two items are purchased by the same user

Still, the information provided by this number is not uniform. Let's take two books which were bought just once, by the a single user.

If such user only bought a dozen books in total, this indicates a high similarity between the two books.

If such user is a big customers, who bought 90% of all books sold on the site, this does not add much to the picture.

Still, in both cases, the weight added into the coocurrence matrix would be the same, whoever customer (big or small) bought the two books together.

In a real case, we have business accounts who would create similarity between two books, although it's clear that they have rare books in their shelves only because they own thousands of books. I'd then prefer a higher weight when the customer owns very few books. I might be wrong, but I saw the problem as the "two people with the same birthday" and computed the information according to a negative binomial distribution. If I have correctly understood, the LLR would not lead to the same result, correct?

Whatever the correctness of my approach above –thanks again for your answer, it was very much appreciated!

Mario

Ashkan said...

Hi Ted, How do you calculate log-liklihood score for unigram. Would you share the java implementation?

Ted Dunning ... apparently Bayesian said...

Ashkan,

I am not sure what you mean by your question.

What would you mean by computing LLR for a unigram.

Ivana Corovc said...

Hi Ted,

I'm confused by this H(rowSums) and H(colSums), what are these?

Let's say that k_11=2, k_12=11, k_21=10 and k_22 = 400.

What would H(colSum) even be? H(12)+H(411)?

Ted Dunning ... apparently Bayesian said...

Ivana,

The R code should make this clear:

llr = function(k) {2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))}
H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0))))}

Here the functions colSums and rowSums return a vector of values. These are processed by H the same way the original matrix was processed because sum doesn't care if it gets a vector or matrix.

Leonardo FernÃ¡ndez SÃ¡nchez said...

The implementation of the log likelihood ratio in Mahout is 2 * MI, where MI = (Matrix Entropy - Row Entropy - Column Entropy).

But in your blog post you say that it's 2 * N * MI.

Why Mahout implementation decided not to include the total count? Is because Mahout entropy calculation is not normalised?

Thanks,
Leo.-

Ted Dunning ... apparently Bayesian said...

Leonardo,

The Mahout implementation of MI is un-normalized and has the N factor already baked in. This avoids dividing by N and then multiplying by N again.

Fantastic algorithm. A few questions, which I'll pose one per comment: I need negative and positive, so I need expected value, as you noted. I have this formula:

ev <- function(k) {

# k[1,2] / sum(k) * k[2,1] / sum(k) * sum(k)
k[1,2] / k[2,2] * k[2,1] / k[2,2] * k[2,2]

}

When I run some data, I find empirically that this seems to be right:

[,1] [,2]
[1,] 13 100
[2,] 100 787
llr: 0.00528737 sum: 1000 ev: 12.70648

If I understand LLR, 0 means zero multiples of random chance, and I find that if I give it 13 in this context, that's pretty close to the expected value of 12.7.

Now, notice I commented out the line with the sum(k) formula. Intuitively I thought it would be that. I'm a CS guy, not a math guy, so my intuition is worth about 10 cents. But is my EV formula right? And why? Why is my denominator k[2,2] and not sum(k)?

Mostly want to ensure I have the EV formula right, but curious about any intuition why it is right if it is easy enough to explain.

Thanks!

Next question. LLR naturally gives weight to any strong deviation from chance. Here I present something close to LLR = 0, as well as two strong deviants (note per my other posting, I'm adding a sign using expected value):

[,1] [,2]
[1,] 1 100
[2,] 100 799
llr: -15.66694 sum: 1000 ev: 12.51564

[,1] [,2]
[1,] 13 100
[2,] 100 787
llr: 0.00528737 sum: 1000 ev: 12.70648

[,1] [,2]
[1,] 25 150
[2,] 50 775
llr: 11.98562 sum: 1000 ev: 9.677419

Now, I'm using this as a product recommendation engine. As a constraint of my business problem, I'm interested in recommending only common products, so does that mean I want the high positive LLR values only?

Seems to me, this is analogous to a one-tail versus two-tail problem. If I want *any* deviant event, I want either "tail". But if I want common products for high-volume sales, I want high positive values. Conversely, if I were doing something like fraud detection, I might want only the strong negative values--because I'm looking for rare deviations.

Thoughts? In short, if I'm recommending products for the masses, is it safe to just use high positive values?

Thanks!

Mulloy Morrow said...

Hi Ted, thanks for the terse post. Where could I get a copy of the java or C classes for these computations? My gmail is mulloymorrow.

Ted Dunning ... apparently Bayesian said...

Mulloy,

You can find the C code here:

https://github.com/tdunning/ancient-stats

and the Java code can be found in Mahout Math library. See

https://github.com/apache/mahout/blob/master/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java

Ted,

What value of rootLLR score can be taken as threshold for registering association between p1 and p2 ? I mean how to decide upon what should we not be excited about.

Ted Dunning ... apparently Bayesian said...

It is hard to take a single number as a threshold because

a) you are probably doing lots of comparisons

b) you may have some pretty strong patterns that are not interesting.

What I usually do is take 3-5 as a firm threshold and then accept the top N scores for some convenient N. For recommendation engines, N = 100-300 is often useful.

Thanks a lot Ted, This is exact answer i was looking for. I am building all together a new concept in services marketplace using this technique of yours!

Ted Dunning ... apparently Bayesian said...

Ping me when it is working!

Shestov Petr said...

Hi Ted!
Is LLR chi square distributed with 1 degree of freedom? If so, than I can use tables for chi square distribution to select threshold for significant co-occurrences with desired level of confidence. Right?
The other question is about formula. In some articles I saw 2*log(MatrixEntropy - RowEntropy - ColumnEntropy) and you have 2*(MatrixEntropy - RowEntropy - ColumnEntropy). It doesn't matter for ranking, but matters for statistical analysis. Which one is right?

Ted Dunning ... apparently Bayesian said...

Petr,

Yes, LLR is chi^2 distributed exactly as Pearson's test is distributed (asymptotically in both cases). For 2x2 tables, this gives one degree of freedom. With n x m tables, there will be (n-1) (m-1) degrees of freedom.

And yes, you can use standard tables. The only proviso is that the square root of LLR is commonly used with a sign that reflects whether the upper left entry in the contingency table is larger or smaller than expected. Root LLR is normally distributed, ideally.

The formula should be 2 N (MatrixEntropy - RowEntropy - ColumnEntropy), but you have to watch out because un-normalized implementations of entropy are sometimes used which incorporate the factor of N.

Can you point me to articles that use each formula? (especially if I have been inconsistent?)

Shestov Petr said...

I saw that formula L=(MatrixEntropy - RowEntropy - ColumnEntropy) and LLR = 2 log(L). It seems to be wrong.

If I'm building a recommender system than 'N' is total number of users, right? If N is big enough, like several millions, than almost every co-occurrence seems to be significant with high confidence since chi square distribution even with alpha=0.005 equals only 7.879 (http://sites.stat.psu.edu/~mga/401/tables/Chi-square-table.pdf). That is a little strange for me.

Ted Dunning ... apparently Bayesian said...

Regarding you comment about the increasing sensitivity of the test, yes, smaller and smaller effects become visible. Of course, as N increases, you also typically get more and more things that you are testing (more words, more word pairs, more products, whatever). That makes any traditional statistical interpretation quite difficult.

What I recommend is that you simply rank all the results you get rather than choosing a threshold from a table. The biggest ones will tend to be the most interesting.

Shestov Petr said...

Oh, I really missed that instead of k's I should use k/sum(k). mahout's spark-itemsimilarity uses plain k's.

andrea said...

Hi Ted,

Thanks for the post. I am trying to apply this method to a movie recommandation system. Each movie has a rating between 1 and 5 so I think the matrik k in my case is:
k_11 = number of users 'like' movie A and B
k_12 = number of users 'like' movie A and 'dislike' B
k_11 = number of users 'dislike' movie A and 'like' B
k_11 = number of users 'dislike' movie A and 'dislike' B
where 'like' means a rating >3 and 'dislike' <=3.
What do you think?

Ted Dunning ... apparently Bayesian said...

Hey there Andrea,

Yes, you can do that.

I think that it is better to use "did not rate as having liked" as the opposite of "liked", however. In fact, the simple act of rating, regardless of value may be just as valuable a feature.

What I would recommend is that you try using this in the context of a multi-modal recommender. Check out Pat Ferrell's blog on the topic: http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/

andrea said...

Thanks so much. I have another quick question about predicting the ratings. Assuming that I build the matrix with all the log-likelihood ratios (llr) for the items then the so called recommendation vector r = h_p*llr where h_p is the history of the user. Unfortunately, the vector r does not contain predicted retings but it can be used to rank the items to recommend. Is there a way to extract predicted ratings from r?
Do you think that interpreting the llr entries as weights I can use weighted average to predict ratings? For example r_1 = (h_p*llr_{...,1})/sum(llr_{...,1} if h_p_i >0) where llr_{...,1} is the first column of the llr matrix and the if statement sum only the weights corresponding to ratings >0 (i.e. the user has seen the movies).
Thanks again.

Ted Dunning ... apparently Bayesian said...

Andrea,

I hope I am not too emphatic here, but why in the world would you really want to predict ratings? Prediction of a ratings is just a left-over from academic work in the 90's and has no real validity in the real world.

When you build a recommendation system in the world, there is only one goal. That is whether it made your users happy. Presenting them with content that they want to see and causing them to engage with what you showed them is what makes them happy. Your users could not possibly care less about whether you predicted what rating they might put on content.

Not only is prediction of the rating not useful, it is counter-productive because it is matching a very odd behavior (rating your content) that is done by a minute part of your audience (typically a few percent unless you force ratings). The results are, not surprisingly, very odd.

It is much better to try to measure whether users actually engage with content you offer. This doesn't mean rate it. It might not even mean consume. With videos, I have found 30 second watches to be good surrogates for engagement. With products, simple measures of product page engagement such as scrolling or click to view more is a good surrogate. My experience with clicks was that it teaches the recommender to spam users and my experience with ratings is that you get very little, very odd data that seems to have little to do with normal user behavior.

Bipul Kumar Bal said...

Hi Ted,

We are thinking of using your likelihood to find the similarity between two products(e-commerce).

While evaluating the scores we found out that for the scenario S1 the score is higher than the score for the scenario S2.

S1 : k11 = 1, k12 = 12636, k21 = 6979 and k22 = 1292420 //LLH score: 125.025
S2 : k11 = 101, k12 = 12586, k21 = 6929, k22 = 1292420 //LLH score: 14.79

By just looking into the numbers it seems that items in S2 should be more similar than the items in S1 as in S1 they occurred together only once compared to S2 where the items occurred together 101 times. In both the cases k11 + k12 + k21 + k22 is same.

I used LLH formula from Mahout.

LLH(k11, k12, k13, k14) = 2 * (total*log(total) + k11*log(k11) + k12*log(k12) + k21*log(k21) + k22*log(k22) - (k11+k12)*log(k11+k12) - (k21+k22)*log(k21+k22) - (k11+k21)*log(k11+k21) - (k12+k22)*log(k12+k22))

Am I missing something or the formula I am using is incorrect?

Ted Dunning ... apparently Bayesian said...
This comment has been removed by the author.
Ted Dunning ... apparently Bayesian said...

Bipul,

That's a great question.
Your numbers are actually just right.

The issue is that LLR is not a measure of similarity. It is a measure of anomaly. It tells you where there is likely to be a non-zero interaction, but doesn't tell you even the sign of the interaction.

In your case, S1 has anomalously low cooccurrence of the two products. You often see this with items with strong brand loyalty like, say, razor blades. In S2, you have anomalously high cooccurrence.

To make this easier to see and deal with, I sometimes use the square root of the LLR score and add a sign according to whether k11 is larger than you might expect or smaller. Since LLR asymptotically $chi^2(1)$ distributed, the square root will be half-normal distributed. With the sign, you now have a measure which is (very) roughly calibrated in units of standard deviations above or below expectations. The Mahout code you mention has an implementation of this. See the Mahout implementation.

Ted Dunning ... apparently Bayesian said...
This comment has been removed by the author.
Sander Stepanov said...

thanks a lot,
1
you mention availability of your PH.D. thesis , it is only needs to ask?
2
may you pls help find some simple code example , better in Python to really understand it

Ted Dunning ... apparently Bayesian said...

Sander,

See here for the dissertation: http://arxiv.org/abs/1207.1847

See here for the python version of the code: https://github.com/tdunning/python-llr

Sander Stepanov said...

thanks a lot
https://github.com/tdunning/python-llr
is really good
thesis
http://arxiv.org/pdf/1207.1847v1.pdf
may you help find short description, pls

Ted Dunning ... apparently Bayesian said...

This blog and the original paper are the best short sources.

In the dissertation, note that lots of the stuff isn't required for all readers. There are 5 chapters that describe particular applications of the technique to different domains. Each of those is relatively short, stands alone and can be read separately if you have a similar interest.

Sander Stepanov said...

Ted,
Thank you very much again for Python code for LLR!!
PracticalMachineLearning MAPR.pdf
Only it would be very kind of you share some simple language (Python, C, ...) education code to understand how LLR works for recommender systems.
You did something with PIG
https://github.com/tdunning/ponies
Sample recommender flow for search as recommendation
but it is black box
By the way per
https://github.com/tdunning/sequencemodel
The sequence anomaly detector from our second in the Practical Machine Learning Series
Looking forward read the second book

Ted Dunning ... apparently Bayesian said...

Sebastian Schelter's article at http://dl.acm.org/citation.cfm?id=2365984 provides some more details.

Currently the best code implementing these ideas is the Spark itemrecommender in Apache Mahout.

guiyang xie said...

Hi,Ted,
I'm using your LLR to find similar items.
Can you send me a copy? My email is keen.guiyang.xie@gmail.com
Thanks!

Ted Dunning ... apparently Bayesian said...

As I pointed out to Sander about 4 comments ago, you can get my dissertation from:

http://arxiv.org/abs/1207.1847

P. said...

Hi Ted,

I'am not sure that you can help me, but...
I don't have any problems to use LLR ratio to find similar items when I have users and items. I use spark-itemsimilarity nad I get results with item, item, LLR. I use Chi-squared distribution with one degrees of freedom and I know which two items are significant similary.

But I want to use spark-rowsimilarity, where I have data like: item, and 4 atributes (for example. author, subject, etc.). I read on mahout page (https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html) that it's LLR too, but I'm not sure how it's calculaded, so I don't know which test it is exactly. It's still chi-squared? If yes, with one degree of freedom beacause we compare 2 item, or for example (4-1)(4-1)=9, because we have 4 atributes?

Thanks a lot, if you know answer!

Ted Dunning ... apparently Bayesian said...

Hey there,

Glad to hear that things are working for you.

I do worry that you seem to be looking at LLR scores too much in the lens of a significance test. That is fine when you are looking at a single test, but when you are doing cooccurrence testing, you have millions to billions of potential tests that you are doing. Furthermore, any upstream downsampling will make significance even harder to interpret. Also, you mention using the chi^2 distribution. I find it easier to compare the signed square root of the LLR to the normal distribution. This allows me to differentiate over and under representation and a scale denominated in standard deviations is easier to talk about with many people. In any case, I usually set the global cutoff to something like 3 to 5 standard deviations (AKA chi^2 of 9 - 25). I typically set this limit by looking at the indicators produced and picking a limit such that there is a large, but not overwhelming number of garbage indicators being produced.

The limit to 100 indicators often implies local cutoffs much larger (say 20-50 standard deviations which are comparable to chi^2 scores 400-2500). These cutoffs are very high in terms of single test significance levels, but may still be lower than what you might need to use if you used something like the Bonferroni correction. In any case, we don't care about *whether* there is structure that violates the null hypothesis. We *know* that there is. We want to predict behavior in the future.

I think that it is very useful to step outside of traditional hypothesis testing. What I find more useful for many situations today is to think more about building models and trying to understand how they will perform on data that we haven't yet seen. IF you look at Breiman's famous paper at https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726, you can see a really good description of the situation and the culture gap.

So what I think is better for building models is to use a global cutoff for all scores that is estimated from first principle, but then in each specific case have a secondary cutoff in terms of number of interesting connections (typically limited to 50 to 100).

For you second question about cooccurrence with attributes, this is most easily done by simply pretending that the characteristics themselves are specially labeled items in the user history and throwing them into the mix for cooccurrence. This isn't a great way to do it, but it can be done very quickly. Better still is to use code specifically designed for cross-occurrence.

For cross-occurrence, it is still the same 2x2 test as with cooccurrence and the same distributions still hold asymptotically in the case of the null hypothesis. The same objections arise about using a threshold test as a test of significance as well.

Unknown said...

Hi,

How to derive "lr = (C11+C21)log(r)+(C12+C22)log(1−r)−C11log(r1)
−C12log(1−r1)−C21log(r2)−C22log(1−r2)"
? From paper "Sentiment Analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques" .

thanks

Ted Dunning ... apparently Bayesian said...

Well, since I didn't have anything to do with writing that paper and know know what the symbols mean, it is hard to say exactly how to derive that.

That said, it looks like a restatement of the form based on entropy of the matrix minus the row and column-wise sum entropies. You can see it above in the blog posting:

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

Christos Zigkolis said...

Hello Ted,

Do you think that LLR can be used in conjunction with association rules? For example, can we somehow make use of the significance test results of LLR to refine/filter/back up the most significant rules given based on lift or confidence?

Thanks a lot for your time,
Christos

Ted Dunning ... apparently Bayesian said...

Christos,

Yes. The LLR is very well suited as a first screen for lots of features such as association rules.

You should be careful not to use the LLR score as a weight, of course.

Christos Zigkolis said...

Let me give you a simple example. Consider two rules A->B and A->C where B is an extremely popular product in my dataset and of course more popular than C. Based on confidence metric, it is likely to have a higher value for the first rule as opposed to the second one. In such cases, can the LLR come in and provide a strong indication of independence for the first rule and dependence of the second?

I have concerns that popular items that happen to be along with others a lot will have high scores and overshadow other strong correlations. I am confident that LLR can help with this right?

Ted Dunning ... apparently Bayesian said...

Christos,

Yes, LLR is a good way to select pairs like this. These aren't really association rules as such any more though. The name that I use is for this is "indicator". I would write the association rule as indicator -> target.

Commonly, the way that this works is to pick the highest N indicators for a particular target by LLR score. If the indicator is very common in general, it will take a whole lot of cooccurrences to get a high LLR score. The frequency of the target doesn't really matter all that much since all of the indicators for that target have the same target prevalence (by definition).

In your example, you have two rules A->B and A->C which indicates you are thinking about things in terms of a common indicator. It is usually better to be a bit more target centric in your thinking (that is, consider A->C, B->C instead). After all, once you have your rules, you can always sort them by indicator instead of target.