As a background, if you add up lots of numbers using a straightforward loop, you can lose precision. In the worse case the loss is \(O(n \epsilon)\), but in virtually all real examples the lossage is \(O(\epsilon \sqrt n)\). If we are summing a billion numbers, the square root is \(\approx 10^5\) so we can potentially lose 5 sig figs (out of 17 available with double precision).

Kahan summation increases the number of floating point operations by \(4 \times\), but using a clever trick and manages to retain most of the bits that would otherwise be lost. Shewchuk summation uses divide and conquer to limit the lossage with \(O(\log n)\) storage and no increase in the number of flops.

There are several cases to consider:

1) online algorithms such as OnlineSummarizer.

2) dot product and friends.

3) general matrix decompositions

In the first case, we can often have millions or even billions of numbers to analyze. that said, however, the input data is typically quite noisy and signal to noise ratios \(> 100\) are actually kind of rare in Mahout applications. Modified Shewchuk estimation (see below for details) could decrease summation error from a few parts in \(10^{12}\) to less than 1 part in \(10^{12}\) at minimal cost. These errors are \(10^{10}\) smaller than the noise in our data so this seems not useful.

In the second case, we almost always are summing products of sparse vectors. Having thousands of non-zero elements is common but millions of non-zeros are quite rare. Billions of non-zeros are unheard of. This means that the errors are going to be trivial.

In the third case, we often have dense matrices, but the sizes are typically on the order of \(100 \times 100\) or less. This makes the errors even smaller than our common dot products.

To me, this seems to say that this isn't worth doing. I am happy to be corrected if you have counter evidence.

Note that BLAS does naive summation and none of the Mahout operations are implemented using anything except double precision floating point.

Here is an experiment that tests to see how big the problem really is:

@Test

public void runKahanSum() {

Random gen = RandomUtils.getRandom();

double ksum = 0; // Kahan sum

double c = 0; // low order bits for Kahan

float sum = 0; // naive sum

float[] vsum = new float[16]; // 8 way decomposed sum

for (int i = 0; i < 1e9; i++) {

float x = (float) (2 * gen.nextDouble() - 1);

double y = x - c;

double t = ksum + y;

c = (t - ksum) - y;

ksum = t;

sum += x;

vsum[i % 16] += x;

}

// now add up the decomposed pieces

double zsum = 0;

for (int i = 0; i < vsum.length; i++) {

zsum += vsum[i];

}

System.out.printf("%.4f %.4f %.4f\n",

ksum, 1e6 * (sum - ksum) / ksum,

1e6 * (zsum - ksum) / ksum);

}

A typical result here is that naive summation gives results that are accurate to within 1 part in \(10^{12}\) 8 way summation manages \(< 0.05\) parts in \(10^{12}\) and 16 way summation is only slightly better than 8 way summation.

If the random numbers being summed are changed to have a mean of zero, then the relative error increases to 1.7 parts in \(10^{12}\) and 0.3 parts in \(10^{12}\), but the absolute error is much smaller.

Generally, it doesn't make sense to do the accumulation in float's because these operations are almost always memory channel bound rather than CPU bound. Changing to floating point arithmetic in spite of this decreases the accuracy to about 500 parts per million, 200 parts per million respectively for naive summation and 8 way summation