Sunday, November 22, 2009

Twitter and small counts

I just got a very nice email from Slideshare that says
"Transactional Data Mining" is being tweeted more than any other document on SlideShare right now. So we've put it on the homepage of SlideShare.net (in the "Hot on Twitter" section).

That sounds sooo exciting. People magazine, here I come!

Of course, when you look into the matter, I really don't think that I am going to have to worry about dodging paparazzi any time soon. What they mean by "tweeted more" seems to be 3 times in a day. Sounds like they should have read my paper about surprise and coincidence!

In fact, for applications like popularity ranking it is often really important to have a robust idea about what is hot and what is not. There are two very common problems to solve, one is to find out which items are rapidly rising and the other is to put a reasonable rank onto things.

The problem of what is rising is surprisingly well well solved by ranking items by the log of the ratio of the number of hits/views/tweets in a recent period of time to the number of hits in a somewhat older (and typically longer) period of time. Well, not quite the ratio. In practice, it is better to offset both numerator and denominator by a constants which are well chosen by looking at the results, but which are actually well-founded in terms of MAP estimators for rates. The reason that this works so well is that popularities are typically distributed in a roughly Zipfian way so taking log of the rate is linearly related to the log of the rank.

A large change in the log-rank of an item is a great indicator of dramatically rising popularity, but rank itself is a pretty fuzzy measurement with small counts (the guys at SlideShare need to hear this). Log of the rate, however, is linearly related to the log of rank, so change in log-rank is proportional to change in log-rate. Moreover, the offset trick when computing the rate ratio largely deals with the small count problems that you have when something goes from no hits to one hit in a time period.

The second problem is computing what the rank of an item really is given bursty nasty data like hit counts. The problem is that you only have a nice tight estimate of the hit rate for the most popular items. Since you want recent rankings, you want to use a shorter period of time for your estimates so the problem of small counts is even worse. Conceptually, a very principled way to do deal with the problem is to embrace the uncertainty and sample the rates associated with each item given the data you have and rank the items. Do this a bunch of times and you have the posterior distribution of the ranks for each item. Moreover, you also have a lookup table that tells you what any particular observed rate means relative to what rank the item might have.

This sampling can be done in a few seconds in R. I think it is cooler to add dithering do the ranked list of items so that each time you look at the list, you get another sample from this distribution. This means that items that might plausibly be in the top 100 or 1000 actually get to be there some of the time and items that are definitely in the top few places are always where they should be. It also adds dynamism to your top-n list which is always good on the web.

6 comments:

Rashmi said...

Ted,

We look at no of views coming in from the tweet, rather than the no of tweets itself. Also, often the same URL might get tweeted through a no of different shortened URLs.

Having said that, your feedback is well taken. My background is in recommender systems (before SlideShare, I worked on recommender systems - noticed that you have some background in that yourself.

Rashmi
CEO & Cofounder, SlideShare

Ted Dunning ... apparently Bayesian said...

Good answer. This makes it likely that the numbers are larger than they seemed.

I hope you didn't think that I was picking on you. I was actually just using this as a hook for talking about small counts.

The methods I describe will, of course, work with other events such as views or references, but there is an interesting problem that arises in trying to combine multiple modalities of interaction into an overall ranking. More about that in another posting.

Anonymous said...

To clarify... what do you mean by (log-)rate here? (Log of) number of hits in some short time period?

How do you compute the offset? Just add 1 to numerator and denominator, as in early smoothing work?

Ted Dunning ... apparently Bayesian said...

I meant log (-rate) and rate is, just as you say, a count over a period of time.

The offsets are indeed a smoothing device although typically 1 is not the value you want. The motivation comes from an examination of the distribution of rates and the construction of a prior that has useful properties. You could make the prior exactly reflect the properties of your items, but I find it more useful to take a somewhat different tack. (see my next blog posting).

Rashmi said...

Ted,

Thanks for the followup posts. I am going to look into your suggestions.

And I did not think you were picking - I appreciated the feedback.

Enjoyed your slideshare presentation as well.

Any chance you could share a spreadsheet with different scenarios as to how this might play out. (referring to your next blog post)

Rashmi

Anonymous said...

Hello Ted,
Last couple of months, i have been following your blog posts, talks, paper on surprise and coincidence. They have provided me many valuable insights. I'm working as research assistant @ company called LogPoint, and my primary task is to use statistical methods for log analytics. I also found your t-digest paper(hosted on github) very useful and used it straightway. Now, i want to get more insights of LLR on sequential/transactional data mining. Is there a paper for this?? Because i found the slide little harder to understand given my present state of knowledge on statistics and Linear algebra.

Umanga Bista,
research Assistant, LogPoint,
Kathmandu, Nepal