Surprise and Coincidence - musings from the long tail

Christos, Yes, LLR is a good way to select pairs ...

2018-05-01T13:25:01.907-07:00

Christos,

Yes, LLR is a good way to select pairs like this. These aren't really association rules as such any more though. The name that I use is for this is "indicator". I would write the association rule as indicator -> target.

Commonly, the way that this works is to pick the highest N indicators for a particular target by LLR score. If the indicator is very common in general, it will take a whole lot of cooccurrences to get a high LLR score. The frequency of the target doesn't really matter all that much since all of the indicators for that target have the same target prevalence (by definition).

In your example, you have two rules A->B and A->C which indicates you are thinking about things in terms of a common indicator. It is usually better to be a bit more target centric in your thinking (that is, consider A->C, B->C instead). After all, once you have your rules, you can always sort them by indicator instead of target.

Thank you Ted for your prompt reply. Let me give ...

2018-04-27T00:24:28.010-07:00

Thank you Ted for your prompt reply.

Let me give you a simple example. Consider two rules A->B and A->C where B is an extremely popular product in my dataset and of course more popular than C. Based on confidence metric, it is likely to have a higher value for the first rule as opposed to the second one. In such cases, can the LLR come in and provide a strong indication of independence for the first rule and dependence of the second?

I have concerns that popular items that happen to be along with others a lot will have high scores and overshadow other strong correlations. I am confident that LLR can help with this right?

Christos, Yes. The LLR is very well suited as a f...

2018-04-20T14:53:06.764-07:00

Christos,

Yes. The LLR is very well suited as a first screen for lots of features such as association rules.

You should be careful not to use the LLR score as a weight, of course.

Hello Ted, Do you think that LLR can be used in c...

2018-04-20T01:55:18.520-07:00

Hello Ted,

Do you think that LLR can be used in conjunction with association rules? For example, can we somehow make use of the significance test results of LLR to refine/filter/back up the most significant rules given based on lift or confidence?

Thanks a lot for your time,
Christos

Well, since I didn't have anything to do with ...

2016-09-12T12:19:11.336-07:00

Well, since I didn't have anything to do with writing that paper and know know what the symbols mean, it is hard to say exactly how to derive that.

That said, it looks like a restatement of the form based on entropy of the matrix minus the row and column-wise sum entropies. You can see it above in the blog posting:

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

Hi, How to derive "lr = (C11+C21)log(r)+(C1...

2016-09-12T03:52:22.279-07:00

Hi,

How to derive "lr = (C11+C21)log(r)+(C12+C22)log(1−r)−C11log(r1)
−C12log(1−r1)−C21log(r2)−C22log(1−r2)" ? From paper "Sentiment Analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques" .

thanks

Hey there, Glad to hear that things are working f...

2016-08-13T16:14:04.178-07:00

Hey there,

Glad to hear that things are working for you.

I do worry that you seem to be looking at LLR scores too much in the lens of a significance test. That is fine when you are looking at a single test, but when you are doing cooccurrence testing, you have millions to billions of potential tests that you are doing. Furthermore, any upstream downsampling will make significance even harder to interpret. Also, you mention using the chi^2 distribution. I find it easier to compare the signed square root of the LLR to the normal distribution. This allows me to differentiate over and under representation and a scale denominated in standard deviations is easier to talk about with many people. In any case, I usually set the global cutoff to something like 3 to 5 standard deviations (AKA chi^2 of 9 - 25). I typically set this limit by looking at the indicators produced and picking a limit such that there is a large, but not overwhelming number of garbage indicators being produced.

The limit to 100 indicators often implies local cutoffs much larger (say 20-50 standard deviations which are comparable to chi^2 scores 400-2500). These cutoffs are very high in terms of single test significance levels, but may still be lower than what you might need to use if you used something like the Bonferroni correction. In any case, we don't care about *whether* there is structure that violates the null hypothesis. We *know* that there is. We want to predict behavior in the future.

I think that it is very useful to step outside of traditional hypothesis testing. What I find more useful for many situations today is to think more about building models and trying to understand how they will perform on data that we haven't yet seen. IF you look at Breiman's famous paper at https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726, you can see a really good description of the situation and the culture gap.

So what I think is better for building models is to use a global cutoff for all scores that is estimated from first principle, but then in each specific case have a secondary cutoff in terms of number of interesting connections (typically limited to 50 to 100).

For you second question about cooccurrence with attributes, this is most easily done by simply pretending that the characteristics themselves are specially labeled items in the user history and throwing them into the mix for cooccurrence. This isn't a great way to do it, but it can be done very quickly. Better still is to use code specifically designed for cross-occurrence.

For cross-occurrence, it is still the same 2x2 test as with cooccurrence and the same distributions still hold asymptotically in the case of the null hypothesis. The same objections arise about using a threshold test as a test of significance as well.

Hi Ted, I'am not sure that you can help me, ...

2016-08-12T05:22:38.465-07:00

Hi Ted,

I'am not sure that you can help me, but...
I don't have any problems to use LLR ratio to find similar items when I have users and items. I use spark-itemsimilarity nad I get results with item, item, LLR. I use Chi-squared distribution with one degrees of freedom and I know which two items are significant similary.

But I want to use spark-rowsimilarity, where I have data like: item, and 4 atributes (for example. author, subject, etc.). I read on mahout page (https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html) that it's LLR too, but I'm not sure how it's calculaded, so I don't know which test it is exactly. It's still chi-squared? If yes, with one degree of freedom beacause we compare 2 item, or for example (4-1)(4-1)=9, because we have 4 atributes?

Thanks a lot, if you know answer!

As I pointed out to Sander about 4 comments ago, y...

2016-08-06T20:27:46.504-07:00

As I pointed out to Sander about 4 comments ago, you can get my dissertation from:

http://arxiv.org/abs/1207.1847

Hi,Ted, I'm using your LLR to find similar ite...

2016-08-06T19:54:53.143-07:00

Hi,Ted,
I'm using your LLR to find similar items.
I'd like to read your doctoral dissertation.
Can you send me a copy? My email is keen.guiyang.xie@gmail.com
Thanks!

Sebastian Schelter's article at http://dl.acm....

2016-07-31T16:49:57.608-07:00

Sebastian Schelter's article at http://dl.acm.org/citation.cfm?id=2365984 provides some more details.

Currently the best code implementing these ideas is the Spark itemrecommender in Apache Mahout.

Fixed the link. (simple google search found it)

2016-07-31T16:47:59.159-07:00

Fixed the link. (simple google search found it)

I continue to believe that recommendations based ...

2016-07-31T15:24:47.638-07:00

I continue to believe that recommendations based on ratings are inherently flawed.
After 8 years it was written
may you pls share some from 0 to end algorithmical language like Python or C code example
for this statement pls.
Or at least some paper, blog with details?

very useful article thanks this link http://www...

2016-07-31T15:15:10.788-07:00

very useful article
thanks
this link
http://www.thalesians.com/archive/public/academic/finance/papers/Zumbach_2000.pdf
is really great, too
thanks

Ted, Thank you very much again for Python code for...

2016-07-31T15:00:29.189-07:00

Ted,
Thank you very much again for Python code for LLR!!
It is really my pleasure to read your book
PracticalMachineLearning MAPR.pdf
Only it would be very kind of you share some simple language (Python, C, ...) education code to understand how LLR works for recommender systems.
You did something with PIG
https://github.com/tdunning/ponies
Sample recommender flow for search as recommendation
but it is black box
By the way per
https://github.com/tdunning/sequencemodel
The sequence anomaly detector from our second in the Practical Machine Learning Series
Looking forward read the second book

web links to blog are not correct may you pls find...

2016-07-31T14:40:02.074-07:00

web links to blog are not correct
may you pls find correct ones

This blog and the original paper are the best shor...

2016-07-28T17:34:10.606-07:00

This blog and the original paper are the best short sources.

In the dissertation, note that lots of the stuff isn't required for all readers. There are 5 chapters that describe particular applications of the technique to different domains. Each of those is relatively short, stands alone and can be read separately if you have a similar interest.

thanks a lot https://github.com/tdunning/python-l...

2016-07-28T16:37:37.575-07:00

thanks a lot
https://github.com/tdunning/python-llr
is really good
thesis
http://arxiv.org/pdf/1207.1847v1.pdf
is too big to read
may you help find short description, pls

Sander, See here for the dissertation: http://arx...

2016-07-28T16:30:47.621-07:00

Sander,

See here for the dissertation: http://arxiv.org/abs/1207.1847

See here for the python version of the code: https://github.com/tdunning/python-llr

thanks a lot, 1 you mention availability of your P...

2016-07-28T09:33:46.657-07:00

thanks a lot,
1
you mention availability of your PH.D. thesis , it is only needs to ask?
2
may you pls help find some simple code example , better in Python to really understand it

2016-04-28T10:29:49.369-07:00

This comment has been removed by the author.

Bipul, That's a great question. Your numbers ...

2016-04-28T10:27:02.771-07:00

Bipul,

That's a great question.
Your numbers are actually just right.

The issue is that LLR is not a measure of similarity. It is a measure of anomaly. It tells you where there is likely to be a non-zero interaction, but doesn't tell you even the sign of the interaction.

In your case, S1 has anomalously low cooccurrence of the two products. You often see this with items with strong brand loyalty like, say, razor blades. In S2, you have anomalously high cooccurrence.

To make this easier to see and deal with, I sometimes use the square root of the LLR score and add a sign according to whether k11 is larger than you might expect or smaller. Since LLR asymptotically $chi^2(1)$ distributed, the square root will be half-normal distributed. With the sign, you now have a measure which is (very) roughly calibrated in units of standard deviations above or below expectations. The Mahout code you mention has an implementation of this. See the Mahout implementation.

2016-04-28T10:24:51.193-07:00

This comment has been removed by the author.

Hi Ted, We are thinking of using your likelihood...

2016-04-28T07:56:04.749-07:00

Hi Ted,

We are thinking of using your likelihood to find the similarity between two products(e-commerce).

While evaluating the scores we found out that for the scenario S1 the score is higher than the score for the scenario S2.

S1 : k11 = 1, k12 = 12636, k21 = 6979 and k22 = 1292420 //LLH score: 125.025
S2 : k11 = 101, k12 = 12586, k21 = 6929, k22 = 1292420 //LLH score: 14.79

By just looking into the numbers it seems that items in S2 should be more similar than the items in S1 as in S1 they occurred together only once compared to S2 where the items occurred together 101 times. In both the cases k11 + k12 + k21 + k22 is same.

I used LLH formula from Mahout.

LLH(k11, k12, k13, k14) = 2 * (total*log(total) + k11*log(k11) + k12*log(k12) + k21*log(k21) + k22*log(k22) - (k11+k12)*log(k11+k12) - (k21+k22)*log(k21+k22) - (k11+k21)*log(k11+k21) - (k12+k22)*log(k12+k22))

Am I missing something or the formula I am using is incorrect?

Andrea, I hope I am not too emphatic here, but wh...

2016-01-15T19:30:27.896-08:00

Andrea,

I hope I am not too emphatic here, but why in the world would you really want to predict ratings? Prediction of a ratings is just a left-over from academic work in the 90's and has no real validity in the real world.

When you build a recommendation system in the world, there is only one goal. That is whether it made your users happy. Presenting them with content that they want to see and causing them to engage with what you showed them is what makes them happy. Your users could not possibly care less about whether you predicted what rating they might put on content.

Not only is prediction of the rating not useful, it is counter-productive because it is matching a very odd behavior (rating your content) that is done by a minute part of your audience (typically a few percent unless you force ratings). The results are, not surprisingly, very odd.

It is much better to try to measure whether users actually engage with content you offer. This doesn't mean rate it. It might not even mean consume. With videos, I have found 30 second watches to be good surrogates for engagement. With products, simple measures of product page engagement such as scrolling or click to view more is a good surrogate. My experience with clicks was that it teaches the recommender to spam users and my experience with ratings is that you get very little, very odd data that seems to have little to do with normal user behavior.