## Wednesday, September 18, 2013

### How Accurate is Mahout for Summing Numbers?

A question was recently posted on the Mahout mailing list suggesting that the Mahout math library was "unwashed" because it didn't use Kahan summation.  My feeling is that this complaint is not founded and Mahout is considerably more washed than the original poster suggests.  Here is why I think this.

As a background, if you add up lots of numbers using a straightforward loop, you can lose precision. In the worse case the loss is $O(n \epsilon)$, but in virtually all real examples the lossage is $O(\epsilon \sqrt n)$. If we are summing a billion numbers, the square root is $\approx 10^5$ so we can potentially lose 5 sig figs (out of 17 available with double precision).

Kahan summation increases the number of floating point operations by $4 \times$, but using a clever trick and manages to retain most of the bits that would otherwise be lost. Shewchuk summation uses divide and conquer to limit the lossage with $O(\log n)$ storage and no increase in the number of flops.

There are several cases to consider:

1) online algorithms such as OnlineSummarizer.

2) dot product and friends.

3) general matrix decompositions

In the first case, we can often have millions or even billions of numbers to analyze. that said, however, the input data is typically quite noisy and signal to noise ratios $> 100$ are actually kind of rare in Mahout applications. Modified Shewchuk estimation (see below for details) could decrease summation error from a few parts in $10^{12}$ to less than 1 part in $10^{12}$ at minimal cost. These errors are $10^{10}$ smaller than the noise in our data so this seems not useful.

In the second case, we almost always are summing products of sparse vectors. Having thousands of non-zero elements is common but millions of non-zeros are quite rare. Billions of non-zeros are unheard of. This means that the errors are going to be trivial.

In the third case, we often have dense matrices, but the sizes are typically on the order of $100 \times 100$ or less. This makes the errors even smaller than our common dot products.

To me, this seems to say that this isn't worth doing. I am happy to be corrected if you have counter evidence.

Note that BLAS does naive summation and none of the Mahout operations are implemented using anything except double precision floating point.

Here is an experiment that tests to see how big the problem really is:

@Test
public void runKahanSum() {
Random gen = RandomUtils.getRandom();

double ksum = 0;                // Kahan sum
double c = 0;                   // low order bits for Kahan
float sum = 0;                 // naive sum
float[] vsum = new float[16];  // 8 way decomposed sum
for (int i = 0; i < 1e9; i++) {
float x = (float) (2 * gen.nextDouble() - 1);
double y = x - c;
double t = ksum + y;
c = (t - ksum) - y;
ksum = t;
sum += x;
vsum[i % 16] += x;
}
// now add up the decomposed pieces
double zsum = 0;
for (int i = 0; i < vsum.length; i++) {
zsum += vsum[i];
}
System.out.printf("%.4f %.4f %.4f\n",
ksum, 1e6 * (sum - ksum) / ksum,
1e6 * (zsum - ksum) / ksum);
}

A typical result here is that naive summation gives results that are accurate to within 1 part in $10^{12}$  8 way summation manages $< 0.05$ parts in $10^{12}$ and 16 way summation is only slightly better than 8 way summation.

If the random numbers being summed are changed to have a mean of zero, then the relative error increases to 1.7 parts in $10^{12}$ and 0.3 parts in $10^{12}$, but the absolute error is much smaller.

Generally, it doesn't make sense to do the accumulation in float's because these operations are almost always memory channel bound rather than CPU bound.  Changing to floating point arithmetic in spite of this decreases the accuracy to about 500 parts per million, 200 parts per million respectively for naive summation and 8 way summation

## Wednesday, April 24, 2013

### Learning to Rank, in a Very Bayesian Way

The problem of ranking comments by a crowd-sourced version of "quality" is a common one on the internet.

James Neufeld suggests that Bayesian Bandit algorithms can be applied to this problem. The basic idea is that you would define a stochastic quality metric whose distribution for each comment depends on the up and down votes that comment has received.

Normal ranking algorithms try to estimate the single best value for this quality metric. Neufeld suggests that this value should be sampled from a beta distribution which models the probability that a user would mark the comment positively given that they have marked the comment at all. To present comments to a user, the metric would be sampled independently for each comment and the comments would be sorted according to the resulting scores. Different presentations would necessarily result in different orders, but as users mark comments positively or negatively, the order should converge to one where the comments presented near the top of the list have the highest probability of being marked positively.

One very nice thing about this approach is that it doesn't waste any cycles on determining the ranking of low quality comments. Once the quality of these comments has been determined to be relatively lower than the best columns, no more learning need be done with those comments. This accelerates learning of the ranking of the best options dramatically.

This idea is interesting enough that I built a quick implementation which you can find on github.  The main sample code there invents several hundred "comments" each with a uniformly sampled probability of getting a positive rating.  The ideal behavior for ordering the comments would be to put the comment with the highest probability of getting a positive rating first and the one with the lowest probability last.  The way that the program proceeds is that it picks a pageful of twenty comments to show and then proceeds to generate ratings for each of the comments on that page according to the underlying probability associated with the items displayed.  The process of generating pages of comments to show and applying feedback is repeated and performance is measured.

Here are some results of running the program.  Here we have 200 total comments, of which 20 are shown on the page that defines which comments are rated.  Precision is measured here to determine how many of the best 10 comments are shown on the page.  As can be seen, the system shows immediate improvement as ratings are collected.  The performance rises from the initially random 10% precision and passes 50% after 30 pages of ratings.

As James demonstrated in his article and as others have demonstrated elsewhere, this class of algorithm is very effective for this sort of bandit problem.  What is much less well known is how easily you can build a system like this.

### Try it yourself

To run this code, you will need git, maven and java 1.7.  To download the source code and compile the system, do this

$git clone git://github.com/tdunning/bandit-ranking.git$ cd bandit-ranking
$mvn package This will download all dependencies of the code, compile the code and run some tests. To run the test program, do this$ java -jar target/bandit-ranking-*-with-dependencies.jar

The output is a thousand lines of numbers that you can drop into R, OmniGraphSketcher or even Excel to produce a plot like the one above.

### Quick code dissection

In the main program for this demo, a BetaBayesFactory is used to construct several BayesianBandit objects (for average results later).  This pattern can be used with other kinds of bandit factories.

The BayesianBandit objects allow you to do a variety of things include sampling (BayesianBandit.sample) for the current best alternative, ranking (BayesianBandit.rank) a number of alternatives and providing training data (BayesianBandit.train).  Sampling is used in a traditional multi-armed bandit setting such as with A/B testing.  Ranking is used as it is here for getting a list of best alternatives and training is used ubiquitously for feeding back training data to the bandit.

Evaluation can be done by computing precision as is done here (how many good items are in the top 20?) or by computing regret.  Regret is defined as the difference between the mean payoff of the best possible choice and the mean payoff of the choice made by the bandit.  For the ranking problem here, I assume that payoff of a page is the sum of the probabilities of positively rating each item on a page.

The BetaBayesFactory internally uses a beta-binomial distribution to model the likelihood of a positive rating for each rank. A more general alternative would be to use a gamma-normal distribution. This can be done by using the GammaNormalBayesFactory instead. This extra generality comes at a cost, however, as the graph to the left shows. Here, the beta-binomial distribution results in considerably faster convergence to perfect precision than the gamma-normal. This is to be expected since the beta-binomial starts off with the assumption that we are modeling a binary random variable that can only take on values of 0 and 1. The gamma-normal distribution has to learn about this constraint itself. That extra learning costs about 50 pages of ratings. Put another way, the cumulative regret is nearly doubled by the choice of the gamma-normal distribution.

In order to understand what the algorithm is really doing at a high level, the graph on the right is helpful.  What it shows is the number of times comments that are at different ranks are shown.  What is striking here is that comments that are below the fourth page get very few trials and even on the second page, the number of impression falls precipitously relative to the first page of comments.  This is what you would expect because in this experiment, it takes only a few ratings on the worst comments to know that they stand essentially no chance of being one of the best.  It is this pattern of not sampling comments that don't need precise ranking that makes Bayesian Bandits so powerful.