Tuesday, December 8, 2009

A call to action

The Guardian and numerous other newspapers around the world published a simultaneous editorial about climate change and warming.

This call to action is urgent. We should heed it. This is both the greatest crisis and the greatest opportunity our species has ever faced. Sadly, failure is an option and many people are working to guarantee it by spreading rumors and lies.

The rest of us have to succeed in spite of that. For everyone's sake.

Monday, November 23, 2009

How to smooth rate estimates

I mentioned in a previous posting that I recommend smoothing rate estimates using offsets on numerator (number of events) and denominator (time period). While it is possible to build a principled prior distribution and do very detailed rate estimation based on this, I find that for "top-100" lists and for "hot-100" lists, it is more useful to adjust the prior with larger strategic goals in mind.

In particular, what I like to do is set the ratio of the offsets to position an item with no data whatsoever (event count = 0, time = 0) so be somewhere in the top half of all items, but generally not in the top 10%. This expresses a general optimism about new items that allows them to achieve high rankings with a very modest burst of enthusiasm from the audience and forces them to provide some proof that they are dogs.

Once this ratio is set, it remains to set the actual magnitudes. This is done by deciding how much many events that you want an item to have or how long you want it to languish before the data overcome the prior. If you want the first 10 events to have equal weight to the prior, set the numerator offset to 10. Alternately, if you want the first day of data to have equal weight to the prior, set the denominator offset to 1 day.

Done. Works. Simple.

I have done the much more complex effort of building detailed prior models and actually estimating rates in a completely principled fashion but I have found two things:

a) business people have other agendas than pure estimation and they want a vote

b) the early estimates are pretty unreliable until the data dominate the prior (duh!). Thus, you may as well set up the prior to make (a) work.

Sunday, November 22, 2009

Twitter and small counts

I just got a very nice email from Slideshare that says
"Transactional Data Mining" is being tweeted more than any other document on SlideShare right now. So we've put it on the homepage of SlideShare.net (in the "Hot on Twitter" section).

That sounds sooo exciting. People magazine, here I come!

Of course, when you look into the matter, I really don't think that I am going to have to worry about dodging paparazzi any time soon. What they mean by "tweeted more" seems to be 3 times in a day. Sounds like they should have read my paper about surprise and coincidence!

In fact, for applications like popularity ranking it is often really important to have a robust idea about what is hot and what is not. There are two very common problems to solve, one is to find out which items are rapidly rising and the other is to put a reasonable rank onto things.

The problem of what is rising is surprisingly well well solved by ranking items by the log of the ratio of the number of hits/views/tweets in a recent period of time to the number of hits in a somewhat older (and typically longer) period of time. Well, not quite the ratio. In practice, it is better to offset both numerator and denominator by a constants which are well chosen by looking at the results, but which are actually well-founded in terms of MAP estimators for rates. The reason that this works so well is that popularities are typically distributed in a roughly Zipfian way so taking log of the rate is linearly related to the log of the rank.

A large change in the log-rank of an item is a great indicator of dramatically rising popularity, but rank itself is a pretty fuzzy measurement with small counts (the guys at SlideShare need to hear this). Log of the rate, however, is linearly related to the log of rank, so change in log-rank is proportional to change in log-rate. Moreover, the offset trick when computing the rate ratio largely deals with the small count problems that you have when something goes from no hits to one hit in a time period.

The second problem is computing what the rank of an item really is given bursty nasty data like hit counts. The problem is that you only have a nice tight estimate of the hit rate for the most popular items. Since you want recent rankings, you want to use a shorter period of time for your estimates so the problem of small counts is even worse. Conceptually, a very principled way to do deal with the problem is to embrace the uncertainty and sample the rates associated with each item given the data you have and rank the items. Do this a bunch of times and you have the posterior distribution of the ranks for each item. Moreover, you also have a lookup table that tells you what any particular observed rate means relative to what rank the item might have.

This sampling can be done in a few seconds in R. I think it is cooler to add dithering do the ranked list of items so that each time you look at the list, you get another sample from this distribution. This means that items that might plausibly be in the top 100 or 1000 actually get to be there some of the time and items that are definitely in the top few places are always where they should be. It also adds dynamism to your top-n list which is always good on the web.

Sunday, November 1, 2009

Video of talk on transactional data

If you want to see more about some the techniques I have used in projects take a look at this talk I gave to the Bay Area ACM chapter.

See "Transactional Data Mining"

Slides are available at here

Wednesday, April 29, 2009

Sampling Dirichlet Distributions

[I just realized that this post from last year was only half the story.  See this post about using the gamma distribution directly to sample Dirchlet distributions]

I just commented on a post by Andrew Gelman about methods for sampling Dirichlet distributions. Those comments were pretty non specific and deserve a bit of amplification.

First off, a Dirichlet distribution is a distribution of real-valued tuples,
\[(x_1 \ldots x_n) \sim \mathrm {Dir}(\pi_1 \ldots \pi_n) \]
such that \(x_i \ge 1\) and \(\sum_i x_i = 1\)

The parameters \(\pi_i\) are all non-negative.

The original question had to do with sampling the Dirichlet parameters, especially from a conjugate distribution. The one and true answer in mathematical terms is that there is, indeed, a continuous distribution which is the conjugate of a Dirichlet. In practical terms, however, that isn't the answer that you really want.

A much more practical answer is that the Dirichlet can be sampled from a prior that is characterized by \(n+1\) non-negative real parameters using the following procedure
\[\begin{aligned}
\left(m_1 \ldots m_n \right) & \sim \mathrm {Dir} (\beta_1 \ldots \beta_n ) \\
\alpha & \sim \exp (\beta_0) \\
\pi_i & \sim \alpha m_i
\end{aligned}
\]
Alternative distributions for \(\alpha\) include the gamma distribution and exponential normal, such as
\[ \log \alpha \sim \mathcal N (0, 2)\]

Tuesday, March 24, 2009

What would really help map-reduce programs?

David Hall is touting a Scala package that he calls SMR (Scala for Map Reduce).

I built something like this quite some time ago using groovy. I called the resulting Groovy map-reduce facade grool because it was pretty thin. The results were even more impressive for toy programs than what SMR achieved. For instance, here is a fully hadoop compatible map-reduce version of word count:

wc = mapReduce(
{key, line, out, report -> line.split().each {w -> out.collect(w,1)}},
{w, counts, out, report -> out.collect(w, counts.sum())}
)

In this example, wc is a function that can then be applied to various inputs as in

wc(["this is input text", "for the word-counter"])

or

wc(new File("/local/input/file")

or even

wc(functionUsingMapReduce(new HdfsFile("/existing/file")))

In spite of how pretty these tiny examples are, I eventually quit using it for day-to-day map-reduce coding.

The problem with things like grool that attempt to make writing map-reduce programs easier is that it doesn't really solve the problems that you face in writing map-reduce programs. It does make word-count much shorter (5-7 lines of code versus >200). It does involve a very cool hack so that the closures passed as arguments are available as map and reduce objects on the right machines. But I found in using grool for production code that it didn't solve the right problem.

The big problems involved in map-reduce programming are

  • map-reduce is just a tad too low-level for most problems.

  • debugging programs written using a facade like this is harder than for programs written using raw java + map-reduce.

  • the boilerplate of map-reduce isn't proportional to problem size so large programs exhibit much less improvement in grool than small programs do.

    What I would much rather have is something that provides much higher-level semantics such as those provided by Pig, but which provides for better integration into real programming languages. The better integration that I need goes two directions. First, I want to be able to, for instance, compute every date in a date range, use those dates to form input paths and then do a computation on those input paths. Pig can't do that except via a very ugly argument substitution hack.

    A second kind of better integration is what Chris Wenzel has been pushing for some time with Cascading. I would like it if I could write programs in what might be called PL/Pig (by analogy with PL/SQL) that would then be tightly integrated into a Cascading flow. Ideally, the PL/Pig program would expose its execution plan to Cascading for global optimization and scheduling.
  • Thursday, January 15, 2009

    Real-time decision making using map-reduce

    Recently Tim Bass described his happiness that the Mahout project has taken on some important tasks that are often applied to real-time decision making.

    I think that Tim's joy is justified, although Mahout is still a very new project that is definitely not a finished work by any means. There are a number of important algorithms that are being worked on that could provide some very important capabilities for real-time decision makers.

    The fact is, though, that map-reduce is no silver bullet. It won't make problems go away. It is an important technology for large-scale computing that lends it self to the batch training of real-time models if not quite to high availability real time decisioning. For that, I tend to add systems like zookeeper and to build systems in the style of the Katta framework.

    In my mind the really important change that needs to happen is that designers of real-time decisioning systems need to embrace what I call scale-free computation.

    Scale free computation is what you get when you don't let the software engineers include the scale of the process as a design parameter. Without that knowledge, they have to build systems that will not require changes when the scale changes. Map-reduce is a great example of this because the map and reduce functions are pure functions. The person who writes those functions has no idea how many machines will be used to compute them ... moreover, the framework is free to apply these functions more than once to data if it find a need.

    Katta does something similar by allowing the manager of a search system to specify what should happen without specifying quite how. The master system uses Zookeeper to maintain state reliably and it examines the specification of what is desired and allocates tasks to search engines. The search engines detect changes in their to-do list and download index shards as necessary. Clients examine the state in Zookeeper and autonomously dispatch requests to search engines and combine results.

    The overall effect again is that the person implementing a new katta search system has very little idea how many machines will be used to run the search. If the corpus is large, then many shards will be used. If throughput is required, then shards will be replicated many times. Neither requires change.

    Scale free design patterns like this are required for modern, highly reliable decision systems. The change in mind-set required is non-trivial and dealing with legacy code will be prodigiously difficult. As usual with major changes in technology, the best results will come after the old code is retired. Hopefully that can happen before us old coders retire!