tag:blogger.com,1999:blog-4453805095942812863.post6655800277795930385..comments2018-01-30T11:00:01.141-08:00Comments on Surprise and Coincidence - musings from the long tail: Surprise and CoincidenceTed Dunning ... apparently Bayesianhttp://www.blogger.com/profile/02498665124454933570noreply@blogger.comBlogger76125tag:blogger.com,1999:blog-4453805095942812863.post-46474866309246456062016-09-12T12:19:11.336-07:002016-09-12T12:19:11.336-07:00Well, since I didn't have anything to do with ...Well, since I didn't have anything to do with writing that paper and know know what the symbols mean, it is hard to say exactly how to derive that.<br /><br />That said, it looks like a restatement of the form based on entropy of the matrix minus the row and column-wise sum entropies. You can see it above in the blog posting:<br /><br />LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))<br />Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-89833906919686439122016-09-12T03:52:22.279-07:002016-09-12T03:52:22.279-07:00Hi,
How to derive "lr = (C11+C21)log(r)+(C1...Hi,<br /><br />How to derive <b> "lr = (C11+C21)log(r)+(C12+C22)log(1−r)−C11log(r1)<br />−C12log(1−r1)−C21log(r2)−C22log(1−r2)" </b>? From paper <i> "Sentiment Analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques" </i>.<br /><br />thanksUnknownhttps://www.blogger.com/profile/15504255831513203487noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-29425178444856904512016-08-13T16:14:04.178-07:002016-08-13T16:14:04.178-07:00Hey there,
Glad to hear that things are working f...Hey there,<br /><br />Glad to hear that things are working for you. <br /><br />I do worry that you seem to be looking at LLR scores too much in the lens of a significance test. That is fine when you are looking at a single test, but when you are doing cooccurrence testing, you have millions to billions of potential tests that you are doing. Furthermore, any upstream downsampling will make significance even harder to interpret. Also, you mention using the chi^2 distribution. I find it easier to compare the signed square root of the LLR to the normal distribution. This allows me to differentiate over and under representation and a scale denominated in standard deviations is easier to talk about with many people. In any case, I usually set the global cutoff to something like 3 to 5 standard deviations (AKA chi^2 of 9 - 25). I typically set this limit by looking at the indicators produced and picking a limit such that there is a large, but not overwhelming number of garbage indicators being produced.<br /><br />The limit to 100 indicators often implies local cutoffs much larger (say 20-50 standard deviations which are comparable to chi^2 scores 400-2500). These cutoffs are very high in terms of single test significance levels, but may still be lower than what you might need to use if you used something like the Bonferroni correction. In any case, we don't care about *whether* there is structure that violates the null hypothesis. We *know* that there is. We want to predict behavior in the future.<br /><br />I think that it is very useful to step outside of traditional hypothesis testing. What I find more useful for many situations today is to think more about building models and trying to understand how they will perform on data that we haven't yet seen. IF you look at Breiman's famous paper at https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726, you can see a really good description of the situation and the culture gap.<br /><br />So what I think is better for building models is to use a global cutoff for all scores that is estimated from first principle, but then in each specific case have a secondary cutoff in terms of number of interesting connections (typically limited to 50 to 100).<br /><br />For you second question about cooccurrence with attributes, this is most easily done by simply pretending that the characteristics themselves are specially labeled items in the user history and throwing them into the mix for cooccurrence. This isn't a great way to do it, but it can be done very quickly. Better still is to use code specifically designed for cross-occurrence.<br /><br />For cross-occurrence, it is still the same 2x2 test as with cooccurrence and the same distributions still hold asymptotically in the case of the null hypothesis. The same objections arise about using a threshold test as a test of significance as well.<br />Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-25426822327729162802016-08-12T05:22:38.465-07:002016-08-12T05:22:38.465-07:00Hi Ted,
I'am not sure that you can help me, ...Hi Ted, <br /><br />I'am not sure that you can help me, but...<br />I don't have any problems to use LLR ratio to find similar items when I have users and items. I use spark-itemsimilarity nad I get results with item, item, LLR. I use Chi-squared distribution with one degrees of freedom and I know which two items are significant similary. <br /><br />But I want to use spark-rowsimilarity, where I have data like: item, and 4 atributes (for example. author, subject, etc.). I read on mahout page (https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html) that it's LLR too, but I'm not sure how it's calculaded, so I don't know which test it is exactly. It's still chi-squared? If yes, with one degree of freedom beacause we compare 2 item, or for example (4-1)(4-1)=9, because we have 4 atributes? <br /><br />Thanks a lot, if you know answer!P.https://www.blogger.com/profile/13586791509589115933noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-887900727983839612016-08-06T20:27:46.504-07:002016-08-06T20:27:46.504-07:00As I pointed out to Sander about 4 comments ago, y...As I pointed out to Sander about 4 comments ago, you can get my dissertation from:<br /><br />http://arxiv.org/abs/1207.1847<br /><br />Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-90344346872795001602016-08-06T19:54:53.143-07:002016-08-06T19:54:53.143-07:00Hi,Ted,
I'm using your LLR to find similar ite...Hi,Ted,<br />I'm using your LLR to find similar items.<br />I'd like to read your doctoral dissertation.<br />Can you send me a copy? My email is keen.guiyang.xie@gmail.com<br />Thanks!guiyang xiehttps://www.blogger.com/profile/04150615950086522668noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-35109037844515993822016-07-31T16:49:57.608-07:002016-07-31T16:49:57.608-07:00Sebastian Schelter's article at http://dl.acm....<br />Sebastian Schelter's article at http://dl.acm.org/citation.cfm?id=2365984 provides some more details.<br /><br />Currently the best code implementing these ideas is the Spark itemrecommender in Apache Mahout.<br />Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-7264984597181308932016-07-31T15:00:29.189-07:002016-07-31T15:00:29.189-07:00Ted,
Thank you very much again for Python code for...Ted,<br />Thank you very much again for Python code for LLR!!<br />It is really my pleasure to read your book <br />PracticalMachineLearning MAPR.pdf<br />Only it would be very kind of you share some simple language (Python, C, ...) education code to understand how LLR works for recommender systems.<br />You did something with PIG<br />https://github.com/tdunning/ponies<br />Sample recommender flow for search as recommendation<br />but it is black box<br />By the way per<br />https://github.com/tdunning/sequencemodel<br />The sequence anomaly detector from our second in the Practical Machine Learning Series<br />Looking forward read the second book <br /> <br />Sander Stepanovhttps://www.blogger.com/profile/00815025625155764930noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-24523600632453513392016-07-28T17:34:10.606-07:002016-07-28T17:34:10.606-07:00This blog and the original paper are the best shor...<br />This blog and the original paper are the best short sources.<br /><br />In the dissertation, note that lots of the stuff isn't required for all readers. There are 5 chapters that describe particular applications of the technique to different domains. Each of those is relatively short, stands alone and can be read separately if you have a similar interest.<br /><br />Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-4589892950432763802016-07-28T16:37:37.575-07:002016-07-28T16:37:37.575-07:00thanks a lot
https://github.com/tdunning/python-l...thanks a lot <br />https://github.com/tdunning/python-llr<br />is really good<br />thesis<br />http://arxiv.org/pdf/1207.1847v1.pdf<br />is too big to read<br />may you help find short description, plsSander Stepanovhttps://www.blogger.com/profile/00815025625155764930noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-15056275411050724922016-07-28T16:30:47.621-07:002016-07-28T16:30:47.621-07:00Sander,
See here for the dissertation: http://arx...Sander,<br /><br />See here for the dissertation: http://arxiv.org/abs/1207.1847<br /><br />See here for the python version of the code: https://github.com/tdunning/python-llr<br /><br />Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-27416111412568765192016-07-28T09:33:46.657-07:002016-07-28T09:33:46.657-07:00thanks a lot,
1
you mention availability of your P...thanks a lot,<br />1<br />you mention availability of your PH.D. thesis , it is only needs to ask?<br />2<br />may you pls help find some simple code example , better in Python to really understand itSander Stepanovhttps://www.blogger.com/profile/00815025625155764930noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-62734686293272975512016-04-28T10:29:49.369-07:002016-04-28T10:29:49.369-07:00This comment has been removed by the author.Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-80885155086922825862016-04-28T10:27:02.771-07:002016-04-28T10:27:02.771-07:00Bipul,
That's a great question.
Your numbers ...Bipul,<br /><br />That's a great question.<br />Your numbers are actually just right.<br /><br />The issue is that LLR is not a measure of similarity. It is a measure of anomaly. It tells you where there is likely to be a non-zero interaction, but doesn't tell you even the sign of the interaction.<br /><br />In your case, S1 has anomalously low cooccurrence of the two products. You often see this with items with strong brand loyalty like, say, razor blades. In S2, you have anomalously high cooccurrence.<br /><br />To make this easier to see and deal with, I sometimes use the square root of the LLR score and add a sign according to whether k11 is larger than you might expect or smaller. Since LLR asymptotically $chi^2(1)$ distributed, the square root will be half-normal distributed. With the sign, you now have a measure which is (very) roughly calibrated in units of standard deviations above or below expectations. The Mahout code you mention has an implementation of this. See <a href="https://github.com/apache/mahout/blob/master/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java#L113-L136" rel="nofollow">the Mahout implementation</a>.Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-31584282078771971782016-04-28T10:24:51.193-07:002016-04-28T10:24:51.193-07:00This comment has been removed by the author.Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-48413132678661512612016-04-28T07:56:04.749-07:002016-04-28T07:56:04.749-07:00Hi Ted,
We are thinking of using your likelihood...Hi Ted, <br /><br />We are thinking of using your likelihood to find the similarity between two products(e-commerce).<br /><br />While evaluating the scores we found out that for the scenario S1 the score is higher than the score for the scenario S2.<br /><br />S1 : k11 = 1, k12 = 12636, k21 = 6979 and k22 = 1292420 //LLH score: 125.025<br />S2 : k11 = 101, k12 = 12586, k21 = 6929, k22 = 1292420 //LLH score: 14.79<br /><br />By just looking into the numbers it seems that items in S2 should be more similar than the items in S1 as in S1 they occurred together only once compared to S2 where the items occurred together 101 times. In both the cases k11 + k12 + k21 + k22 is same.<br /><br />I used LLH formula from Mahout.<br /><br />LLH(k11, k12, k13, k14) = 2 * (total*log(total) + k11*log(k11) + k12*log(k12) + k21*log(k21) + k22*log(k22) - (k11+k12)*log(k11+k12) - (k21+k22)*log(k21+k22) - (k11+k21)*log(k11+k21) - (k12+k22)*log(k12+k22))<br /><br />Am I missing something or the formula I am using is incorrect?Bipul Kumar Balhttps://www.blogger.com/profile/05533710808920377061noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-68965308917712684722016-01-15T19:30:27.896-08:002016-01-15T19:30:27.896-08:00Andrea,
I hope I am not too emphatic here, but wh...Andrea,<br /><br />I hope I am not too emphatic here, but why in the world would you really want to predict ratings? Prediction of a ratings is just a left-over from academic work in the 90's and has no real validity in the real world. <br /><br />When you build a recommendation system in the world, there is only one goal. That is whether it made your users happy. Presenting them with content that they want to see and causing them to engage with what you showed them is what makes them happy. Your users could not possibly care less about whether you predicted what rating they might put on content.<br /><br />Not only is prediction of the rating not useful, it is counter-productive because it is matching a very odd behavior (rating your content) that is done by a minute part of your audience (typically a few percent unless you force ratings). The results are, not surprisingly, very odd.<br /><br />It is much better to try to measure whether users actually engage with content you offer. This doesn't mean rate it. It might not even mean consume. With videos, I have found 30 second watches to be good surrogates for engagement. With products, simple measures of product page engagement such as scrolling or click to view more is a good surrogate. My experience with clicks was that it teaches the recommender to spam users and my experience with ratings is that you get very little, very odd data that seems to have little to do with normal user behavior.<br />Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-27658302609029123782016-01-15T17:44:12.542-08:002016-01-15T17:44:12.542-08:00Thanks so much. I have another quick question abou...Thanks so much. I have another quick question about predicting the ratings. Assuming that I build the matrix with all the log-likelihood ratios (llr) for the items then the so called recommendation vector r = h_p*llr where h_p is the history of the user. Unfortunately, the vector r does not contain predicted retings but it can be used to rank the items to recommend. Is there a way to extract predicted ratings from r?<br />Do you think that interpreting the llr entries as weights I can use weighted average to predict ratings? For example r_1 = (h_p*llr_{...,1})/sum(llr_{...,1} if h_p_i >0) where llr_{...,1} is the first column of the llr matrix and the if statement sum only the weights corresponding to ratings >0 (i.e. the user has seen the movies).<br />Thanks again.andreahttps://www.blogger.com/profile/17967507680945820641noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-10727239278835643642016-01-15T13:27:36.201-08:002016-01-15T13:27:36.201-08:00Hey there Andrea,
Yes, you can do that.
I think ...Hey there Andrea,<br /><br />Yes, you can do that.<br /><br />I think that it is better to use "did not rate as having liked" as the opposite of "liked", however. In fact, the simple act of rating, regardless of value may be just as valuable a feature.<br /><br />What I would recommend is that you try using this in the context of a multi-modal recommender. Check out Pat Ferrell's blog on the topic: http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-58815120733562504182016-01-15T06:00:05.692-08:002016-01-15T06:00:05.692-08:00Hi Ted,
Thanks for the post. I am trying to apply...Hi Ted,<br /><br />Thanks for the post. I am trying to apply this method to a movie recommandation system. Each movie has a rating between 1 and 5 so I think the matrik k in my case is:<br />k_11 = number of users 'like' movie A and B<br />k_12 = number of users 'like' movie A and 'dislike' B<br />k_11 = number of users 'dislike' movie A and 'like' B<br />k_11 = number of users 'dislike' movie A and 'dislike' B<br />where 'like' means a rating >3 and 'dislike' <=3.<br />What do you think?<br />Thanks in advance.<br />andreahttps://www.blogger.com/profile/17967507680945820641noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-72494747895696851202015-10-19T10:18:51.539-07:002015-10-19T10:18:51.539-07:00Oh, I really missed that instead of k's I shou...Oh, I really missed that instead of k's I should use k/sum(k). mahout's spark-itemsimilarity uses plain k's.Shestov Petrhttps://www.blogger.com/profile/06800202772164323879noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-62972941115936761382015-10-18T14:43:20.126-07:002015-10-18T14:43:20.126-07:00Regarding you comment about the increasing sensiti...Regarding you comment about the increasing sensitivity of the test, yes, smaller and smaller effects become visible. Of course, as N increases, you also typically get more and more things that you are testing (more words, more word pairs, more products, whatever). That makes any traditional statistical interpretation quite difficult.<br /><br />What I recommend is that you simply rank all the results you get rather than choosing a threshold from a table. The biggest ones will tend to be the most interesting.<br /><br />Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-12021904067313928472015-10-18T03:24:47.569-07:002015-10-18T03:24:47.569-07:00In this article: http://wortschatz.uni-leipzig.de/...In this article: http://wortschatz.uni-leipzig.de/~sbordag/papers/BordagMC08.pdf<br />I saw that formula L=(MatrixEntropy - RowEntropy - ColumnEntropy) and LLR = 2 log(L). It seems to be wrong.<br /><br />If I'm building a recommender system than 'N' is total number of users, right? If N is big enough, like several millions, than almost every co-occurrence seems to be significant with high confidence since chi square distribution even with alpha=0.005 equals only 7.879 (http://sites.stat.psu.edu/~mga/401/tables/Chi-square-table.pdf). That is a little strange for me.Shestov Petrhttps://www.blogger.com/profile/06800202772164323879noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-24494342941984155732015-10-16T14:34:40.854-07:002015-10-16T14:34:40.854-07:00Petr,
Yes, LLR is chi^2 distributed exactly as Pe...Petr,<br /><br />Yes, LLR is chi^2 distributed exactly as Pearson's test is distributed (asymptotically in both cases). For 2x2 tables, this gives one degree of freedom. With n x m tables, there will be (n-1) (m-1) degrees of freedom.<br /><br />And yes, you can use standard tables. The only proviso is that the square root of LLR is commonly used with a sign that reflects whether the upper left entry in the contingency table is larger or smaller than expected. Root LLR is normally distributed, ideally.<br /><br />The formula should be 2 N (MatrixEntropy - RowEntropy - ColumnEntropy), but you have to watch out because un-normalized implementations of entropy are sometimes used which incorporate the factor of N. <br /><br />Can you point me to articles that use each formula? (especially if I have been inconsistent?)<br />Ted Dunning ... apparently Bayesianhttps://www.blogger.com/profile/02498665124454933570noreply@blogger.comtag:blogger.com,1999:blog-4453805095942812863.post-75563826622954789732015-10-16T05:48:51.606-07:002015-10-16T05:48:51.606-07:00Hi Ted!
Is LLR chi square distributed with 1 degre...Hi Ted!<br />Is LLR chi square distributed with 1 degree of freedom? If so, than I can use tables for chi square distribution to select threshold for significant co-occurrences with desired level of confidence. Right?<br />The other question is about formula. In some articles I saw 2*log(MatrixEntropy - RowEntropy - ColumnEntropy) and you have 2*(MatrixEntropy - RowEntropy - ColumnEntropy). It doesn't matter for ranking, but matters for statistical analysis. Which one is right?<br />Thanks in advance!Shestov Petrhttps://www.blogger.com/profile/06800202772164323879noreply@blogger.com