## Monday, June 7, 2021

### What's new in T-Digest New Version 3.3?

There's a new version of $t$-digest in town. This new version doesn't much change how things work from the standpoint of the API, but it does improve accuracy under most circumstances. If you are using $t$-digest, it is a generally good idea to upgrade to the 3.2 release.

Let's dig into the changes after a quick review of how the $t$-digest works.

### Density estimation through clustering

The basic idea behind the $t$-digest is that if you cluster your data, the position and number of samples in each cluster can be used to estimate the distribution of your original data. Further, if you guarantee that the clusters near the extreme values are kept small and the clusters near the median are allowed to grow, you can maintain very good accuracy for extreme quantiles while sacrificing only a small amount of accuracy for central values. Because clustering in one dimension is very different from clustering in multiple dimensions, the $t$-digest can provide a very good trade-off between memory consumption and accuracy.

The $t$-digest guarantees not to use more than a set amount of memory, but this can theoretically be shown to make it impossible to give guarantees on the relative accuracy of quantile estimates. Indeed, a pathological distribution of data can cause a $t$-digest to give large errors. Happily, a sufficiently pathological distribution of values is essentially impossible to have in any physical measurement. For instance, if you have times that range from picoseconds to the age of the universe with massive skew, you will still be hundreds of powers of ten short of a distribution sufficiently pathological to cause large errors.

The sizes of the clusters in a $t$-digest are bounded using a so-called scale function and a compression factor. The scale factor determines how aggressive the digest is about keeping extremal clusters small and the compression factor limits the number of clusters retained.

### Better boundary conditions

The latest version of the $t$-digest maintains several invariants more assiduously than previous version. First of all, the minimum and maximum values ever seen are retained explicitly. Secondly, the digest makes sure that the first and last centroid always have unit weight except when the K_0 scale function is used. This improves the quality of interpolation at the end points as we will see in the next section.

### Better interpolation

The most important accuracy changes to the $t$-digest (besides general code simplification and cleanups) have to do with the way that interpolation is handled. The basic idea is that centroids that represent a single point need special handling because we know exactly where that point is. Moreover, we know for the first and last centroid that at least one sample is at the global minimum (or maximum respectively) and thus we can treat the first (or last) centroid as if it has one fewer sample than observed. With the K_0 scale function, extreme centroids can have more than one sample and if such an extreme centroid has exactly two samples, we can infer the exact location of the second sample allowing us to do better interpolation. Even if there are more samples than that, we can pull the extreme sample out of the mean and get slightly better interpolation.

The case where a singleton centroid is next to a centroid with more samples is handled better as well.

How this works is illustrated here. Here we have two cases with three centroids. On the left, all three centroids have weight greater than one while on the right, two centroids have weight equal to one.

The basic idea is that we normally assume that half of the points for each centroid land on either side of the centroid. That means that the number of samples between two centroids is assumed, for interpolation purposes, to be the sum of the halves from the two adjacent centroids. But when we have a neighboring centroid with just one sample, we don't have to linearly interpolate all the way up to that centroid and thus can spread the weight from the non-singleton centroid over a slightly narrower range.

This is important near the tails of any digest, but it is really important to get this right when the total amount of data that has been added to a digest is a small-ish multiple of the compression factor of the digest. This is because any such digest will have many singleton centroids near the tail and a clean transition to non-singleton centroids near the center can be a major source of error.

The impact of this can be seen in comparisons against the relative error version of the KLL digest. Here the compression factor of the $t$-digest was adjusted to give a stored digest size of roughly 1kB or 4kB.  The first value represents relatively normal operation for a $t$-digest with compression factor in the range of 100 or so while the larger value matches the size of the KLL digest. The following figure shows this comparison for varying numbers of samples.

In these figures, you can see that all three digests have zero error for small number of samples such that $qn \approx 10$. This is because both digests are remembering the exact sample values for the 10 smallest samples by collapsing more medial samples together except in the notable case of the smaller $t$-digest with $q=10^{-4}$ and $n=10^6$. When $qn\approx 100$, this is no longer possible and the ability to interpolate between singletons and larger centroids becomes crucial. This transition point is the worst case for the $t$-digest and for $qn > 100$, the ability to interpolate with larger centroids allows even the small $t$-digest to achieve substantially smaller errors than the KLL digest.

### Two new papers

The $t$-digest was described in a recent open access paper in the journal Software Impacts.  It doesn't introduce any new material, but is an open-access peer-reviewed citation if you need one.

Another article is a bit more confrontational and is as yet unpublished as far as I know but is available in pre-print form. This article concisely describes both the $t$-digest and the KLL sketch, but is notable for providing an attack on the $t$-digest accuracy in the form of a pathological distribution (appropriately named $\mathcal D_{hard}$) that is designed to be roughly as skewed as any distribution can be when represented with floating point numbers. In fact, the central half of the attack distribution is in the range $[ -10^{-154}, 10^{-154}]$ while the maximum value is $\approx 10^{308}$. For reference, the ratio of the age of the observable universe to the time light takes to transit a proton's diameter is only about 33 orders of magnitude.

Note that such an adversarially designed distribution actually must exist because the fixed memory guarantees of the $t$-digest are impossible to meet if you maintain relative accuracy for all distributions. Thus, the question is not whether there are distributions that degrade the accuracy of the $t$-digest. Rather, it is a question of whether it is plausible that any real data will ever be like these pathological cases and how gracefully $t$-digest behaves under duress. The evidence so far indicates that no real data is anything like $\mathcal D_{hard}$ and that $t$-digest is relatively well behaved even when it cannot succeed.

The $\mathcal D_{hard}$ distribution is still useful in less extreme forms for hardening the $t$-digest even if it doesn't apply much to real data. As an example, here is a graph that shows the accuracy of estimating the CDF of $\mathcal D_{hard}$ with a scale parameter of $e_{Max}=50$ which corresponds to a dynamic range of $10^{75}$. Note that the axes are on a log-odds scale which can de-emphasize errors.

In contrast, when the scale is pushed to the maximum level $e_{Max}=308$, we can see that the quantile estimation of a $t$-digest with compression factor of 200 is still accurate for the tails but essentially all resolution has been lost in the range from $q \in [0.1, 0.9]$. For reference, the range for the problematic region is $6 \times 10^{221}$ smaller than the total range. The range for the central half of the distribution is smaller than the total range by an amount quite a bit larger than the maximum dynamic range for 64-bit IEEE floating point numbers.

Increasing $\delta$ to a value of 500 decreases the region where $t$-digest is unable to produce useful estimates to roughly the central half of the $\mathcal D_{hard}$ distribution. So, if you actually have data with this much dynamic range, you have a work-around.

### Lots of code cleanups

This release has benefited from several years to code cleanups, many due to community contributions by an army of careful readers of the code. Most of these improvements have been made to the MergingDigest to the partial neglect of the AVLTreeDigest

## Tuesday, October 13, 2020

### What is the right number of clusters?

This is a question that comes up *all* the time when people first encounter clustering. Sadly, the information that they get about how to choose the right number of clusters is exactly wrong in a lot of the situations. The single right answer is that you should choose the parameterization of your clustering that optimizes whatever downstream use you have for clustering.

Far too often, that downstream use is simply ignored and advice like finding the knee in the residual is given as if it applied universally.
There are some really cool algorithms that apparently let the number of clusters come from the data. For instance, check thisOr take a look at this description of DP-means which applies these ideas of Bayesian modeling to the problem of clustering.

These are very exciting and you can use these methods for some pretty impressive practical results. For instance, this really fast version of $k$-means uses something akin to DP-means as an internal step to build good sketches of an empirical distribution that can give you a high quality clustering (probably).

But there is a really dark side here. The fact is that you are just moving the arbitrary decision about the number of clusters into an arbitrary decision about which prior distribution to pick.

The fact is, there is no universal best number of clusters for $k$-means clustering. You have to make assumptions and those assumptions may run very, very counter to reality. Whether those assumptions concern a heuristic choice of $k$ for $k$-means, or a heuristic choice of a scaling parameter $\lambda$ for DP-means, or a base distribution $G_0$ for a Dirichlet process, or even just deciding which data to ignore, you are still making a strong decision about what is best and that decision does not come from the data. Each of these decisions, together with choices like how you define distance are just different ways of encoding our prejudices.

And it is completely necessary to encode our prejudices.

The real-world term for these assumptions is called “domain knowledge”. You need to know why you are clustering data to understand how to cluster it.

As a classic example, the data in the figure below are all the exact same synthetic dataset viewed with different definitions of distance. Should there be one, two, three or four clusters? Or should there be 200? Each of these is a correct answer for different applications.

(see  for the R script that generated these figures)

So let's step through some practical thoughts on clustering:

- first, think about what you want to do. Is the clustering going to make pretty pictures? Or are you using the clustering as an input to a machine learning system?

- second, think about distance and scaling. It is very unlikely that the data as you get it will be what you want. Some variables might be nearly insignificant and should be removed or scaled down. Others might be important, but measured on a very different scale from the rest of the variables you observe. There might well be other ways to improve the sense of proximity than Euclidean distance on your original data. That takes us to the next point.

- consider dimensionality reduction before clustering. Algorithms like UMAP can vastly simply the structure of your data (if you have a decent metric). The sophistication of your projection can have the result of allowing a simpler clustering algorithm.

- try something like $k$-means or $DP$-means to see how it goes. To evaluate this, test your data against its ultimate use. If that is making pictures, look at the pictures. If it is intended to generate features for a downstream model check how much lift your clustering gives you.

As an example, suppose that our data is heavily intertwined like these two spirals.

The conventional wisdom is that this kind of data is not suitable for simple clustering methods like $k$-means. But here, I have used 40 clusters (see the x marks) and you can see that the data nearest to each cluster is from a single class. That means that the distance to each of the clusters is an interesting restatement of the original data that makes things much simpler.

Note that this is the opposite of dimensionality reduction. The original data had 2 dimensions, but the restated data as distances to clusters now has 40 dimensions. Classification in this restated space is vastly simpler than in the original space because we just have to find which of the 40 values is smallest to find which color the point should be. And, indeed, that is what we see.

Here is an ROC curve for two linear models. The brown line shows a linear classifier that is based on the original data. You can see how it barely exceeds the performance of a random classifier (the black diagonal). On the other hand, the classifier that uses the inverse distance to each of the 40 clusters is very good (you can tell because the green line goes very close to the upper left corner).

In fact, this model is so simple that you could hand code it by finding the nearest cluster and using the dominant color for that cluster.

This is wonderful and cool. It should be remembered, however, that you could get similarly good results by using a fairly simple neural network with enough hidden nodes. This is because the hidden nodes in a neural network function very similarly to the way that this cluster based classifier is working. This happens because the multiplication by the weights for the first layer is a dot product and this dot product is closely related to Euclidean distance.

The point here is that 40 clusters is far more than would have been recommended by most of the simple criteria that many people recommend (such as finding the knee in the total distance). That difference is because the intended use is very different.

So to mis-quote Forrest Gump, clustering is as clustering does!

## Sunday, March 25, 2018

### Getting the right answer ... much later

A whole lot of years ago, I was a poor student at the University of Colorado in Boulder. Having nothing better to do (like school work) I was wandering the halls of the upper floors of the Engineering Center, and one name tag caught my eye.  It said "S. Ulam".

Being a geek and having gone to high school in Los Alamos, I knew about Stanislaw Ulam. At least, I knew a bit about the work he had done during the Manhattan project at Los Alamos. And that he was among the greats of mathematics. But it seemed totally implausible that the person behind that door could really be Stanislaw Ulam. Ulam was a creature of myth to me. Not just from another generation, but practically from another world. There was no way that it could be him. No way.

So I had to find out.

I knocked.

And was invited in.

So I asked, "Are you Stanislaw Ulam?". He seemed amused. "Yes".  "Uh..." I said.

Just as a note to anyone meeting a god who steps down onto the earth to walk among us, I suggest you have a question figured out ahead of time because you aren't going to come up with anything very good in the moment. At least I couldn't. In a similar vein, figure out your three wishes ahead of time as well in case you meet a genie.

After a moment of hurried thought, I came up with "Could we talk?". Seriously. That is the best I could do.

He said that would be fine, but that he had about 20 minutes of work to do first. To keep me busy, he gave me a problem to work on during that time. Not surprisingly, that problem was a really good one for assessing my mathematical sophistication.

That sophistication was not very high (and really still isn't decades later) partly because of youth and partly because of attitude and partly because schools can't deal with students who learn quickly. My history with math was that during an explosive few weeks in the 8th grade I had gotten the bug and worked my way through all of the high school math curriculum including calculus. It was a wonderful time and didn't feel like learning so much as just remembering how things worked. But there wasn't much anything to be done beyond that because there wasn't much anybody to talk to who knew what lay past that because we lived on a military base in Germany. I could wander through books, but I didn't really have enough understanding to go much further than a bit of differential geometry and multivariate calculus. Later, when we moved to Los Alamos, there were all kinds of people I could have learned from but I was engrossed in electronics and computing and German and Russian.

The problem that he gave me was to prove a simple fixed point theorem. Given a set $S$ and a continuous function $f: S \rightarrow S$ such that $f$ is a contraction, show that $f$ has a fixed point in $S$. A contraction is a function that maps distinct points closer together, $x_1 \ne x_2 \implies \rho(x_1, x_2) > \rho(f(x_1), f(x_2))$ where $\rho$ is some metric on $S$.

As with all good math questions, the first response should always be clarifications to make sure that the problem is actually well understood. Even though it has been a very long time, I remember asking if I could assume that $S$ is compact and that $f$ is continuous. As I remember, I didn't actually finish a proof during the short time I had then, but I did sketch out an approach that started by noting that $\rho(x, f(x)) > \rho(f(x), f(f(x)))$. Unfortunately, I took a wrong turn at that point and thought about limits. That apparently didn't matter and I had some great talks with Professor Ulam and his friend Jan Micielski.

One of the cool things about mathematics is that questions like this wind up in the back of your head and can pop out at any time. That just happened as I was reading a review of the latest edition of Proofs From the Book. This looks like a great book with all kinds of interesting insights to be had, but I just looked at their wonderfully short proof of the fundamental theorem of algebra. That proof goes something like this

1. Every real-valued function $f$ over a compact set $S$ has a minimum in $S$
2. Every point $x$ such that a complex polynomial is non-zero, $|p(x)|>0$, has a nearby point $x'$ such that $|p(x')| < |p(x)|$
3. Thus, the minimum of $|p(x)|$ exists, but all points $x$ such that $|p(x)|>0$ are not the minimum. Therefore, for some point $x_0$ we have $|p(x_0)|=0$
Point 2 takes a tiny bit of algebra, but overall this proof is stunningly simple and beautiful. But in the way that these things happen, after reading this proof I remembered the contraction theorem that I hadn't quite come up with so long ago and it occurred to me that this same trick would apply there.

The outline is
1. For any fixed point $x^*$, we have $\rho(x^*,f(x^*)) = 0$
2. For any other point $x$ we have $\rho(x, f(x)) > 0$
3. The function $d(x) = \rho(x, f(x))$ is continuous and has a minimum in $S$
4. For any point $x$ such that $d(x) \ne 0$, we have $d(x) > d(f(x))$ and thus $d(x)$ is not minimized at $x$
5. Thus, there exists some point $x^*$ where $d(x^*)=0$, that is to say $x^*$ is a fixed point.
Isn't it just the way of the world that you think of the perfect answer just a little bit too late?

## Saturday, June 17, 2017

### Generalized Log-Likelihood Test for Poisson Distribution

In another blog, I described how a generalized log-likelihood ratio can be used to find interesting differences in counts. In the simplest cases, we have counts for some kind of observation (A and not A) under two conditions (B and not B). The question of interest is whether the the rate of A varies with whether or not condition B applies. In classical statistics, we would be talk about having a test against a null hypothesis, but in practice we want to look at lots of A's under lots of different conditions B. Testing so many situations makes it very hard to use the classical machinery well, so we have to look at the problem a bit differently.

As I mentioned in other blogs, we can still use a classically derived test known as the generalized log-likelihood ratio as a way of simply ranking different A-B combinations against each other according to how interesting they are. Even without being able to interpret the statistical score as a statistical test, we get useful results in practice.

The generalized log-likelihood ratio most commonly used in these situations is derived assuming we have two binomial observations. This test can be extended to compare two multinomial conditions for independence, but this is rarely done, if only because comparing two binomials is so darned useful.

With the binomial test, we look at the number of positive observations out of some total number of observations for each condition. In some situations, it is much more natural to talk about the number of positive observations not as a fraction of all observations, but as a fraction of the duration of the condition. For instance, we might talk about the number of times we noticed a particular kind of network error under different conditions. In such a case, we probably can say how long we looked for the errors under each condition, but it can be very hard to say how many observations there were without an error.

For a Poisson distribution under two conditions $A$ and $\neg A$, we can observe a count for each of the conditions as well as the total time over which the count is taken. We can arrange our results this way:

 Count $$\Delta t$$ $$A$$ $$k_1$$ $$\,\,\,\, t_1 \,\,\,\,$$ $$\neg A$$ $$k_2$$ $$t_2$$

This is suggestive of the way that counts from two binomial observations can be arranged to look for a difference under different conditions \cite{dunning93}.

We can investigate whether the Poisson distribution the same under both conditions using the generalized log-likelihood ratio test. Such a test uses $\lambda$, the generalized log-likelihood ratio,
$\lambda = \frac{ \max_{\theta_1 = \theta_2 = \theta_0} p(k_1 | \theta_0)p(k_2 | \theta_0) }{ \max_{\theta_1, \theta_2} p(k_1 | \theta_1)p(k_2 | \theta_2) }$
According to Wilks\cite{wilks1938} and also later Chernoff\cite{Chernoff1954}, the quantity $-2 \log \lambda$ is asymptotically $\chi^2$ distributed with one degree of freedom.

For the Poisson distribution,
$p(k | \theta, t) = \frac{(\theta t)^k e^{-\theta t}}{k!} \\ \log p(k|\theta, t) = k \log \theta t - \theta t - \log k!$
The maximum likelihood estimator $\hat\theta$ can be computed by maximizing the log probability
$\max_\theta \frac{(\theta t)^k e^{-\theta t}}{k!} = \max_\theta \log \frac{(\theta t)^k e^{-\theta t}}{k!} \\ \log \frac{(\theta t)^k e^{-\theta t}}{k!} = k \log \theta t - \theta t - \log k! \\ \frac{\partial \log L(k | \theta, t)}{\partial \theta} = \frac{k}{ \theta} - t =0 \\ \hat \theta = k$
Returning to the log-likelihood ratio test, after some cancellation we get
$-\log \lambda = k_1 \log k_1 + k_2 \log k_2 - k_1 \log \frac{k_1+k_2}{t_1+t_2} t_1 - k_2 \log \frac{k_1+k_2}{t_1+t_2} t_2$
Some small rearrangement gives the following preferred form that is very reminiscent of the form most commonly to compute the log-likelihood ratio test for binomials and multinomials
$-2 \log \lambda = 2 \left( k_1 \log \frac{k_1}{t_1} + k_2 \log \frac{k_2}{t_2} - (k_1+k_2) \log \frac{k_1+k_2}{t_1+t_2} \right)$

## Wednesday, September 18, 2013

### How Accurate is Mahout for Summing Numbers?

A question was recently posted on the Mahout mailing list suggesting that the Mahout math library was "unwashed" because it didn't use Kahan summation.  My feeling is that this complaint is not founded and Mahout is considerably more washed than the original poster suggests.  Here is why I think this.

As a background, if you add up lots of numbers using a straightforward loop, you can lose precision. In the worse case the loss is $O(n \epsilon)$, but in virtually all real examples the lossage is $O(\epsilon \sqrt n)$. If we are summing a billion numbers, the square root is $\approx 10^5$ so we can potentially lose 5 sig figs (out of 17 available with double precision).

Kahan summation increases the number of floating point operations by $4 \times$, but using a clever trick and manages to retain most of the bits that would otherwise be lost. Shewchuk summation uses divide and conquer to limit the lossage with $O(\log n)$ storage and no increase in the number of flops.

There are several cases to consider:

1) online algorithms such as OnlineSummarizer.

2) dot product and friends.

3) general matrix decompositions

In the first case, we can often have millions or even billions of numbers to analyze. that said, however, the input data is typically quite noisy and signal to noise ratios $> 100$ are actually kind of rare in Mahout applications. Modified Shewchuk estimation (see below for details) could decrease summation error from a few parts in $10^{12}$ to less than 1 part in $10^{12}$ at minimal cost. These errors are $10^{10}$ smaller than the noise in our data so this seems not useful.

In the second case, we almost always are summing products of sparse vectors. Having thousands of non-zero elements is common but millions of non-zeros are quite rare. Billions of non-zeros are unheard of. This means that the errors are going to be trivial.

In the third case, we often have dense matrices, but the sizes are typically on the order of $100 \times 100$ or less. This makes the errors even smaller than our common dot products.

To me, this seems to say that this isn't worth doing. I am happy to be corrected if you have counter evidence.

Note that BLAS does naive summation and none of the Mahout operations are implemented using anything except double precision floating point.

Here is an experiment that tests to see how big the problem really is:

@Test
public void runKahanSum() {
Random gen = RandomUtils.getRandom();

double ksum = 0;                // Kahan sum
double c = 0;                   // low order bits for Kahan
float sum = 0;                 // naive sum
float[] vsum = new float[16];  // 8 way decomposed sum
for (int i = 0; i < 1e9; i++) {
float x = (float) (2 * gen.nextDouble() - 1);
double y = x - c;
double t = ksum + y;
c = (t - ksum) - y;
ksum = t;
sum += x;
vsum[i % 16] += x;
}
// now add up the decomposed pieces
double zsum = 0;
for (int i = 0; i < vsum.length; i++) {
zsum += vsum[i];
}
System.out.printf("%.4f %.4f %.4f\n",
ksum, 1e6 * (sum - ksum) / ksum,
1e6 * (zsum - ksum) / ksum);
}

A typical result here is that naive summation gives results that are accurate to within 1 part in $10^{12}$  8 way summation manages $< 0.05$ parts in $10^{12}$ and 16 way summation is only slightly better than 8 way summation.

If the random numbers being summed are changed to have a mean of zero, then the relative error increases to 1.7 parts in $10^{12}$ and 0.3 parts in $10^{12}$, but the absolute error is much smaller.

Generally, it doesn't make sense to do the accumulation in float's because these operations are almost always memory channel bound rather than CPU bound.  Changing to floating point arithmetic in spite of this decreases the accuracy to about 500 parts per million, 200 parts per million respectively for naive summation and 8 way summation

## Wednesday, April 24, 2013

### Learning to Rank, in a Very Bayesian Way

The problem of ranking comments by a crowd-sourced version of "quality" is a common one on the internet.

James Neufeld suggests that Bayesian Bandit algorithms can be applied to this problem. The basic idea is that you would define a stochastic quality metric whose distribution for each comment depends on the up and down votes that comment has received.

Normal ranking algorithms try to estimate the single best value for this quality metric. Neufeld suggests that this value should be sampled from a beta distribution which models the probability that a user would mark the comment positively given that they have marked the comment at all. To present comments to a user, the metric would be sampled independently for each comment and the comments would be sorted according to the resulting scores. Different presentations would necessarily result in different orders, but as users mark comments positively or negatively, the order should converge to one where the comments presented near the top of the list have the highest probability of being marked positively.

One very nice thing about this approach is that it doesn't waste any cycles on determining the ranking of low quality comments. Once the quality of these comments has been determined to be relatively lower than the best columns, no more learning need be done with those comments. This accelerates learning of the ranking of the best options dramatically.

This idea is interesting enough that I built a quick implementation which you can find on github.  The main sample code there invents several hundred "comments" each with a uniformly sampled probability of getting a positive rating.  The ideal behavior for ordering the comments would be to put the comment with the highest probability of getting a positive rating first and the one with the lowest probability last.  The way that the program proceeds is that it picks a pageful of twenty comments to show and then proceeds to generate ratings for each of the comments on that page according to the underlying probability associated with the items displayed.  The process of generating pages of comments to show and applying feedback is repeated and performance is measured.

Here are some results of running the program.  Here we have 200 total comments, of which 20 are shown on the page that defines which comments are rated.  Precision is measured here to determine how many of the best 10 comments are shown on the page.  As can be seen, the system shows immediate improvement as ratings are collected.  The performance rises from the initially random 10% precision and passes 50% after 30 pages of ratings.

As James demonstrated in his article and as others have demonstrated elsewhere, this class of algorithm is very effective for this sort of bandit problem.  What is much less well known is how easily you can build a system like this.

### Try it yourself

To run this code, you will need git, maven and java 1.7.  To download the source code and compile the system, do this

$git clone git://github.com/tdunning/bandit-ranking.git$ cd bandit-ranking
$mvn package This will download all dependencies of the code, compile the code and run some tests. To run the test program, do this$ java -jar target/bandit-ranking-*-with-dependencies.jar

The output is a thousand lines of numbers that you can drop into R, OmniGraphSketcher or even Excel to produce a plot like the one above.

### Quick code dissection

In the main program for this demo, a BetaBayesFactory is used to construct several BayesianBandit objects (for average results later).  This pattern can be used with other kinds of bandit factories.

The BayesianBandit objects allow you to do a variety of things include sampling (BayesianBandit.sample) for the current best alternative, ranking (BayesianBandit.rank) a number of alternatives and providing training data (BayesianBandit.train).  Sampling is used in a traditional multi-armed bandit setting such as with A/B testing.  Ranking is used as it is here for getting a list of best alternatives and training is used ubiquitously for feeding back training data to the bandit.

Evaluation can be done by computing precision as is done here (how many good items are in the top 20?) or by computing regret.  Regret is defined as the difference between the mean payoff of the best possible choice and the mean payoff of the choice made by the bandit.  For the ranking problem here, I assume that payoff of a page is the sum of the probabilities of positively rating each item on a page.

The BetaBayesFactory internally uses a beta-binomial distribution to model the likelihood of a positive rating for each rank. A more general alternative would be to use a gamma-normal distribution. This can be done by using the GammaNormalBayesFactory instead. This extra generality comes at a cost, however, as the graph to the left shows. Here, the beta-binomial distribution results in considerably faster convergence to perfect precision than the gamma-normal. This is to be expected since the beta-binomial starts off with the assumption that we are modeling a binary random variable that can only take on values of 0 and 1. The gamma-normal distribution has to learn about this constraint itself. That extra learning costs about 50 pages of ratings. Put another way, the cumulative regret is nearly doubled by the choice of the gamma-normal distribution.

In order to understand what the algorithm is really doing at a high level, the graph on the right is helpful.  What it shows is the number of times comments that are at different ranks are shown.  What is striking here is that comments that are below the fourth page get very few trials and even on the second page, the number of impression falls precipitously relative to the first page of comments.  This is what you would expect because in this experiment, it takes only a few ratings on the worst comments to know that they stand essentially no chance of being one of the best.  It is this pattern of not sampling comments that don't need precise ranking that makes Bayesian Bandits so powerful.

## Friday, October 26, 2012

### References for On-line Algorithms

[See also my more recent blog on this talking about using bandits for ranking rated items like comments]

I have been speaking lately about how various on-line algorithms have substantial potential for various real-time learning applications.  The most notable of these algorithms are Thompson sampling for real-time handling of multi-armed bandit and contextual bandit problems and an algorithm due to Shindler, Myerson and Wang for fast $k$-means clustering of data.  Since I have had a number of requests for references back to the original sources for these works, I figured a few blog posts would be a good thing.  This post will describe the multi-armed bandit work and the next will describe the clustering work.

#### The Basic Problem - Multi-armed Bandits

For the bandit problems, there are two basic problems to be dealt with.  The first and most basic problem is that of the multi-armed bandit.  In this problem, you can sample from any of a finite number of distributions and your goal is to maximize the average value of the values that you get.  This can be cast into a number of practical settings in which you select which slot machine to put a quarter into, or you select which on-line ad to present to a user or you select which landing page to deliver to a user's browser should see when they visit a particular URL.  It is common to simplify this case further by assuming a stationary distribution.  Obviously, at least one of the distributions you are picking from has a mean equal to the large mean of any alternative.  Any time you take a sample from a distribution that has a smaller mean, you fall behind the theoretical best, on average, that you could have achieved by picking from (one of) the best distributions.  The degree by which you fall behind is known as the regret that you incur.

The key to the multi-armed bandit problem is that you cannot know which distribution might have the largest mean.  This means that you have to sample all of the distributions in order to estimate their means, but this implies that you have to sample from the lesser distributions in order to determine that their are, in fact, inferior.

There are well known bounds on how well you can actually solve this problem.  There are also a number of algorithms that have regret on par with these bounds or come reasonably close to these bounds.  Mostly, however, these known solutions have limitations either on the number of distributions they can consider or on the complexity of the solution.

Kuleshov and Precup provide some good examples of how to compare different bandit algorithms in their paper.  This tutorial on bandits provides a wider view of different forms of multi-armed bandit problems with a number of references.

Conspicuously missing from most lists of references, however, is all the recent work using Thompson sampling.  These algorithms, which I have referred to as Bayesian Bandits, have particularly nice properties of simplicity and optimality.  Chapelle and Li provide an empirical look at performance with these algorithms compared to upper confidence bound (UCB) algorithms.  The last paragraph of that paper laments the lack of a theoretical analysis of these algorithms, but that lack was cured shortly in this paper by Agrawal and Goyal.  Scott provided a more comprehensive view of these algorithms under the name of randomized probability matching.

The idea behind Bayesian Bandits is quite simple.  For each bandit, we maintain use the observations so far to build a posterior distribution for the mean of the associated payoff distribution.  For binary payoffs, it is common to use a $\beta$-binomial distribution and for other cases a $\gamma$-normal distribution works well.  To pick a bandit, we sample a mean for each bandit from these posterior distributions and then pick the bandit with the largest sampled mean.  The new sample from that bandit gives us more data which refines the posterior distribution for that bandit.  We can repeat this process as long as desired.

#### Extensions to Contextual Bandits

One of the most important characteristics of the Thompson sampling approaches (aka randomized probability matching aka Bayesian Bandits) is that they can be extended to more complex situations.  One setting that I have found particularly useful involves optimizing return not just from a few bandits, but from a parameterized set of bandits that could conceivably even be infinite.  The transformation from the parameters to the bandit distribution is unknown, but if we could know that, we would be able to search the parameter space to find the bandit with the highest mean payoff.

This formulation is a generalization of the previous case because we can take the parameter to be an integer from $1 \ldots k$ where there are $k$ bandits and the transformation consists of the mean payoffs for each of the $k$ bandits.

The algorithm in the contextual case simply consists of sampling the transformation from some posterior distribution and then solving for the parameters of the bandit that we would like to use.  Some of the parameters might be fixed by the context we are working in which is where the name contextual bandits comes in.

The paper by Scott alludes to this formulation, but the most approachable work on this that I know of is the paper by Graepel, Candela, Borchert, and Herbrich.  In this paper, they describe the operation of AdPredictor, a system used by the Bing search engine to target ads using context.