Comments on Surprise and Coincidence - musings from the long tail: Surprise and Coincidence

Christos, Yes, LLR is a good way to select pairs ...

2018-05-01T13:25:01.907-07:00

Christos,

Yes, LLR is a good way to select pairs like this. These aren't really association rules as such any more though. The name that I use is for this is "indicator". I would write the association rule as indicator -> target.

Commonly, the way that this works is to pick the highest N indicators for a particular target by LLR score. If the indicator is very common in general, it will take a whole lot of cooccurrences to get a high LLR score. The frequency of the target doesn't really matter all that much since all of the indicators for that target have the same target prevalence (by definition).

In your example, you have two rules A->B and A->C which indicates you are thinking about things in terms of a common indicator. It is usually better to be a bit more target centric in your thinking (that is, consider A->C, B->C instead). After all, once you have your rules, you can always sort them by indicator instead of target.

Thank you Ted for your prompt reply. Let me give ...

2018-04-27T00:24:28.010-07:00

Thank you Ted for your prompt reply.

Let me give you a simple example. Consider two rules A->B and A->C where B is an extremely popular product in my dataset and of course more popular than C. Based on confidence metric, it is likely to have a higher value for the first rule as opposed to the second one. In such cases, can the LLR come in and provide a strong indication of independence for the first rule and dependence of the second?

I have concerns that popular items that happen to be along with others a lot will have high scores and overshadow other strong correlations. I am confident that LLR can help with this right?

Christos, Yes. The LLR is very well suited as a f...

2018-04-20T14:53:06.764-07:00

Christos,

Yes. The LLR is very well suited as a first screen for lots of features such as association rules.

You should be careful not to use the LLR score as a weight, of course.

Hello Ted, Do you think that LLR can be used in c...

2018-04-20T01:55:18.520-07:00

Hello Ted,

Do you think that LLR can be used in conjunction with association rules? For example, can we somehow make use of the significance test results of LLR to refine/filter/back up the most significant rules given based on lift or confidence?

Thanks a lot for your time,
Christos

Well, since I didn't have anything to do with ...

2016-09-12T12:19:11.336-07:00

Well, since I didn't have anything to do with writing that paper and know know what the symbols mean, it is hard to say exactly how to derive that.

That said, it looks like a restatement of the form based on entropy of the matrix minus the row and column-wise sum entropies. You can see it above in the blog posting:

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

Hi, How to derive "lr = (C11+C21)log(r)+(C1...

2016-09-12T03:52:22.279-07:00

Hi,

How to derive "lr = (C11+C21)log(r)+(C12+C22)log(1−r)−C11log(r1)
−C12log(1−r1)−C21log(r2)−C22log(1−r2)" ? From paper "Sentiment Analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques" .

thanks

Hey there, Glad to hear that things are working f...

2016-08-13T16:14:04.178-07:00

Hey there,

Glad to hear that things are working for you.

I do worry that you seem to be looking at LLR scores too much in the lens of a significance test. That is fine when you are looking at a single test, but when you are doing cooccurrence testing, you have millions to billions of potential tests that you are doing. Furthermore, any upstream downsampling will make significance even harder to interpret. Also, you mention using the chi^2 distribution. I find it easier to compare the signed square root of the LLR to the normal distribution. This allows me to differentiate over and under representation and a scale denominated in standard deviations is easier to talk about with many people. In any case, I usually set the global cutoff to something like 3 to 5 standard deviations (AKA chi^2 of 9 - 25). I typically set this limit by looking at the indicators produced and picking a limit such that there is a large, but not overwhelming number of garbage indicators being produced.

The limit to 100 indicators often implies local cutoffs much larger (say 20-50 standard deviations which are comparable to chi^2 scores 400-2500). These cutoffs are very high in terms of single test significance levels, but may still be lower than what you might need to use if you used something like the Bonferroni correction. In any case, we don't care about *whether* there is structure that violates the null hypothesis. We *know* that there is. We want to predict behavior in the future.

I think that it is very useful to step outside of traditional hypothesis testing. What I find more useful for many situations today is to think more about building models and trying to understand how they will perform on data that we haven't yet seen. IF you look at Breiman's famous paper at https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726, you can see a really good description of the situation and the culture gap.

So what I think is better for building models is to use a global cutoff for all scores that is estimated from first principle, but then in each specific case have a secondary cutoff in terms of number of interesting connections (typically limited to 50 to 100).

For you second question about cooccurrence with attributes, this is most easily done by simply pretending that the characteristics themselves are specially labeled items in the user history and throwing them into the mix for cooccurrence. This isn't a great way to do it, but it can be done very quickly. Better still is to use code specifically designed for cross-occurrence.

For cross-occurrence, it is still the same 2x2 test as with cooccurrence and the same distributions still hold asymptotically in the case of the null hypothesis. The same objections arise about using a threshold test as a test of significance as well.

Hi Ted, I'am not sure that you can help me, ...

2016-08-12T05:22:38.465-07:00

Hi Ted,

I'am not sure that you can help me, but...
I don't have any problems to use LLR ratio to find similar items when I have users and items. I use spark-itemsimilarity nad I get results with item, item, LLR. I use Chi-squared distribution with one degrees of freedom and I know which two items are significant similary.

But I want to use spark-rowsimilarity, where I have data like: item, and 4 atributes (for example. author, subject, etc.). I read on mahout page (https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html) that it's LLR too, but I'm not sure how it's calculaded, so I don't know which test it is exactly. It's still chi-squared? If yes, with one degree of freedom beacause we compare 2 item, or for example (4-1)(4-1)=9, because we have 4 atributes?

Thanks a lot, if you know answer!

As I pointed out to Sander about 4 comments ago, y...

2016-08-06T20:27:46.504-07:00

As I pointed out to Sander about 4 comments ago, you can get my dissertation from:

http://arxiv.org/abs/1207.1847

Hi,Ted, I'm using your LLR to find similar ite...

2016-08-06T19:54:53.143-07:00

Hi,Ted,
I'm using your LLR to find similar items.
I'd like to read your doctoral dissertation.
Can you send me a copy? My email is keen.guiyang.xie@gmail.com
Thanks!

Sebastian Schelter's article at http://dl.acm....

2016-07-31T16:49:57.608-07:00

Sebastian Schelter's article at http://dl.acm.org/citation.cfm?id=2365984 provides some more details.

Currently the best code implementing these ideas is the Spark itemrecommender in Apache Mahout.

Ted, Thank you very much again for Python code for...

2016-07-31T15:00:29.189-07:00

Ted,
Thank you very much again for Python code for LLR!!
It is really my pleasure to read your book
PracticalMachineLearning MAPR.pdf
Only it would be very kind of you share some simple language (Python, C, ...) education code to understand how LLR works for recommender systems.
You did something with PIG
https://github.com/tdunning/ponies
Sample recommender flow for search as recommendation
but it is black box
By the way per
https://github.com/tdunning/sequencemodel
The sequence anomaly detector from our second in the Practical Machine Learning Series
Looking forward read the second book

This blog and the original paper are the best shor...

2016-07-28T17:34:10.606-07:00

This blog and the original paper are the best short sources.

In the dissertation, note that lots of the stuff isn't required for all readers. There are 5 chapters that describe particular applications of the technique to different domains. Each of those is relatively short, stands alone and can be read separately if you have a similar interest.

thanks a lot https://github.com/tdunning/python-l...

2016-07-28T16:37:37.575-07:00

thanks a lot
https://github.com/tdunning/python-llr
is really good
thesis
http://arxiv.org/pdf/1207.1847v1.pdf
is too big to read
may you help find short description, pls

Sander, See here for the dissertation: http://arx...

2016-07-28T16:30:47.621-07:00

Sander,

See here for the dissertation: http://arxiv.org/abs/1207.1847

See here for the python version of the code: https://github.com/tdunning/python-llr

thanks a lot, 1 you mention availability of your P...

2016-07-28T09:33:46.657-07:00

thanks a lot,
1
you mention availability of your PH.D. thesis , it is only needs to ask?
2
may you pls help find some simple code example , better in Python to really understand it

2016-04-28T10:29:49.369-07:00

This comment has been removed by the author.

Bipul, That's a great question. Your numbers ...

2016-04-28T10:27:02.771-07:00

Bipul,

That's a great question.
Your numbers are actually just right.

The issue is that LLR is not a measure of similarity. It is a measure of anomaly. It tells you where there is likely to be a non-zero interaction, but doesn't tell you even the sign of the interaction.

In your case, S1 has anomalously low cooccurrence of the two products. You often see this with items with strong brand loyalty like, say, razor blades. In S2, you have anomalously high cooccurrence.

To make this easier to see and deal with, I sometimes use the square root of the LLR score and add a sign according to whether k11 is larger than you might expect or smaller. Since LLR asymptotically $chi^2(1)$ distributed, the square root will be half-normal distributed. With the sign, you now have a measure which is (very) roughly calibrated in units of standard deviations above or below expectations. The Mahout code you mention has an implementation of this. See the Mahout implementation.

2016-04-28T10:24:51.193-07:00

This comment has been removed by the author.

Hi Ted, We are thinking of using your likelihood...

2016-04-28T07:56:04.749-07:00

Hi Ted,

We are thinking of using your likelihood to find the similarity between two products(e-commerce).

While evaluating the scores we found out that for the scenario S1 the score is higher than the score for the scenario S2.

S1 : k11 = 1, k12 = 12636, k21 = 6979 and k22 = 1292420 //LLH score: 125.025
S2 : k11 = 101, k12 = 12586, k21 = 6929, k22 = 1292420 //LLH score: 14.79

By just looking into the numbers it seems that items in S2 should be more similar than the items in S1 as in S1 they occurred together only once compared to S2 where the items occurred together 101 times. In both the cases k11 + k12 + k21 + k22 is same.

I used LLH formula from Mahout.

LLH(k11, k12, k13, k14) = 2 * (total*log(total) + k11*log(k11) + k12*log(k12) + k21*log(k21) + k22*log(k22) - (k11+k12)*log(k11+k12) - (k21+k22)*log(k21+k22) - (k11+k21)*log(k11+k21) - (k12+k22)*log(k12+k22))

Am I missing something or the formula I am using is incorrect?

Andrea, I hope I am not too emphatic here, but wh...

2016-01-15T19:30:27.896-08:00

Andrea,

I hope I am not too emphatic here, but why in the world would you really want to predict ratings? Prediction of a ratings is just a left-over from academic work in the 90's and has no real validity in the real world.

When you build a recommendation system in the world, there is only one goal. That is whether it made your users happy. Presenting them with content that they want to see and causing them to engage with what you showed them is what makes them happy. Your users could not possibly care less about whether you predicted what rating they might put on content.

Not only is prediction of the rating not useful, it is counter-productive because it is matching a very odd behavior (rating your content) that is done by a minute part of your audience (typically a few percent unless you force ratings). The results are, not surprisingly, very odd.

It is much better to try to measure whether users actually engage with content you offer. This doesn't mean rate it. It might not even mean consume. With videos, I have found 30 second watches to be good surrogates for engagement. With products, simple measures of product page engagement such as scrolling or click to view more is a good surrogate. My experience with clicks was that it teaches the recommender to spam users and my experience with ratings is that you get very little, very odd data that seems to have little to do with normal user behavior.

Thanks so much. I have another quick question abou...

2016-01-15T17:44:12.542-08:00

Thanks so much. I have another quick question about predicting the ratings. Assuming that I build the matrix with all the log-likelihood ratios (llr) for the items then the so called recommendation vector r = h_p*llr where h_p is the history of the user. Unfortunately, the vector r does not contain predicted retings but it can be used to rank the items to recommend. Is there a way to extract predicted ratings from r?
Do you think that interpreting the llr entries as weights I can use weighted average to predict ratings? For example r_1 = (h_p*llr_{...,1})/sum(llr_{...,1} if h_p_i >0) where llr_{...,1} is the first column of the llr matrix and the if statement sum only the weights corresponding to ratings >0 (i.e. the user has seen the movies).
Thanks again.

Hey there Andrea, Yes, you can do that. I think ...

2016-01-15T13:27:36.201-08:00

Hey there Andrea,

Yes, you can do that.

I think that it is better to use "did not rate as having liked" as the opposite of "liked", however. In fact, the simple act of rating, regardless of value may be just as valuable a feature.

What I would recommend is that you try using this in the context of a multi-modal recommender. Check out Pat Ferrell's blog on the topic: http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/

Hi Ted, Thanks for the post. I am trying to apply...

2016-01-15T06:00:05.692-08:00

Hi Ted,

Thanks for the post. I am trying to apply this method to a movie recommandation system. Each movie has a rating between 1 and 5 so I think the matrik k in my case is:
k_11 = number of users 'like' movie A and B
k_12 = number of users 'like' movie A and 'dislike' B
k_11 = number of users 'dislike' movie A and 'like' B
k_11 = number of users 'dislike' movie A and 'dislike' B
where 'like' means a rating >3 and 'dislike' <=3.
What do you think?
Thanks in advance.

Oh, I really missed that instead of k's I shou...

2015-10-19T10:18:51.539-07:00

Oh, I really missed that instead of k's I should use k/sum(k). mahout's spark-itemsimilarity uses plain k's.