Monday, June 7, 2021

What's new in T-Digest New Version 3.3?

There's a new version of $t$-digest in town. This new version doesn't much change how things work from the standpoint of the API, but it does improve accuracy under most circumstances. If you are using $t$-digest, it is a generally good idea to upgrade to the 3.2 release.

Let's dig into the changes after a quick review of how the $t$-digest works.

Density estimation through clustering

The basic idea behind the $t$-digest is that if you cluster your data, the position and number of samples in each cluster can be used to estimate the distribution of your original data. Further, if you guarantee that the clusters near the extreme values are kept small and the clusters near the median are allowed to grow, you can maintain very good accuracy for extreme quantiles while sacrificing only a small amount of accuracy for central values. Because clustering in one dimension is very different from clustering in multiple dimensions, the $t$-digest can provide a very good trade-off between memory consumption and accuracy.

The $t$-digest guarantees not to use more than a set amount of memory, but this can theoretically be shown to make it impossible to give guarantees on the relative accuracy of quantile estimates. Indeed, a pathological distribution of data can cause a $t$-digest to give large errors. Happily, a sufficiently pathological distribution of values is essentially impossible to have in any physical measurement. For instance, if you have times that range from picoseconds to the age of the universe with massive skew, you will still be hundreds of powers of ten short of a distribution sufficiently pathological to cause large errors.

The sizes of the clusters in a $t$-digest are bounded using a so-called scale function and a compression factor. The scale factor determines how aggressive the digest is about keeping extremal clusters small and the compression factor limits the number of clusters retained. 

Better boundary conditions

The latest version of the $t$-digest maintains several invariants more assiduously than previous version. First of all, the minimum and maximum values ever seen are retained explicitly. Secondly, the digest makes sure that the first and last centroid always have unit weight except when the K_0 scale function is used. This improves the quality of interpolation at the end points as we will see in the next section.

Better interpolation

The most important accuracy changes to the $t$-digest (besides general code simplification and cleanups) have to do with the way that interpolation is handled. The basic idea is that centroids that represent a single point need special handling because we know exactly where that point is. Moreover, we know for the first and last centroid that at least one sample is at the global minimum (or maximum respectively) and thus we can treat the first (or last) centroid as if it has one fewer sample than observed. With the K_0 scale function, extreme centroids can have more than one sample and if such an extreme centroid has exactly two samples, we can infer the exact location of the second sample allowing us to do better interpolation. Even if there are more samples than that, we can pull the extreme sample out of the mean and get slightly better interpolation.

The case where a singleton centroid is next to a centroid with more samples is handled better as well.

How this works is illustrated here. Here we have two cases with three centroids. On the left, all three centroids have weight greater than one while on the right, two centroids have weight equal to one.

The basic idea is that we normally assume that half of the points for each centroid land on either side of the centroid. That means that the number of samples between two centroids is assumed, for interpolation purposes, to be the sum of the halves from the two adjacent centroids. But when we have a neighboring centroid with just one sample, we don't have to linearly interpolate all the way up to that centroid and thus can spread the weight from the non-singleton centroid over a slightly narrower range.

This is important near the tails of any digest, but it is really important to get this right when the total amount of data that has been added to a digest is a small-ish multiple of the compression factor of the digest. This is because any such digest will have many singleton centroids near the tail and a clean transition to non-singleton centroids near the center can be a major source of error.

The impact of this can be seen in comparisons against the relative error version of the KLL digest. Here the compression factor of the $t$-digest was adjusted to give a stored digest size of roughly 1kB or 4kB.  The first value represents relatively normal operation for a $t$-digest with compression factor in the range of 100 or so while the larger value matches the size of the KLL digest. The following figure shows this comparison for varying numbers of samples.

In these figures, you can see that all three digests have zero error for small number of samples such that $qn \approx 10$. This is because both digests are remembering the exact sample values for the 10 smallest samples by collapsing more medial samples together except in the notable case of the smaller $t$-digest with $q=10^{-4}$ and $n=10^6$. When $qn\approx 100$, this is no longer possible and the ability to interpolate between singletons and larger centroids becomes crucial. This transition point is the worst case for the $t$-digest and for $qn > 100$, the ability to interpolate with larger centroids allows even the small $t$-digest to achieve substantially smaller errors than the KLL digest.

Two new papers

The $t$-digest was described in a recent open access paper in the journal Software Impacts.  It doesn't introduce any new material, but is an open-access peer-reviewed citation if you need one.

Another article is a bit more confrontational and is as yet unpublished as far as I know but is available in pre-print form. This article concisely describes both the $t$-digest and the KLL sketch, but is notable for providing an attack on the $t$-digest accuracy in the form of a pathological distribution (appropriately named $\mathcal D_{hard}$) that is designed to be roughly as skewed as any distribution can be when represented with floating point numbers. In fact, the central half of the attack distribution is in the range $[ -10^{-154}, 10^{-154}]$ while the maximum value is $\approx 10^{308}$. For reference, the ratio of the age of the observable universe to the time light takes to transit a proton's diameter is only about 33 orders of magnitude. 

Note that such an adversarially designed distribution actually must exist because the fixed memory guarantees of the $t$-digest are impossible to meet if you maintain relative accuracy for all distributions. Thus, the question is not whether there are distributions that degrade the accuracy of the $t$-digest. Rather, it is a question of whether it is plausible that any real data will ever be like these pathological cases and how gracefully $t$-digest behaves under duress. The evidence so far indicates that no real data is anything like $\mathcal D_{hard}$ and that $t$-digest is relatively well behaved even when it cannot succeed.

The $\mathcal D_{hard}$ distribution is still useful in less extreme forms for hardening the $t$-digest even if it doesn't apply much to real data. As an example, here is a graph that shows the accuracy of estimating the CDF of $\mathcal D_{hard}$ with a scale parameter of $e_{Max}=50$ which corresponds to a dynamic range of $10^{75}$. Note that the axes are on a log-odds scale which can de-emphasize errors.


In contrast, when the scale is pushed to the maximum level $e_{Max}=308$, we can see that the quantile estimation of a $t$-digest with compression factor of 200 is still accurate for the tails but essentially all resolution has been lost in the range from $q \in [0.1, 0.9]$. For reference, the range for the problematic region is $6 \times 10^{221}$ smaller than the total range. The range for the central half of the distribution is smaller than the total range by an amount quite a bit larger than the maximum dynamic range for 64-bit IEEE floating point numbers.

Increasing $\delta$ to a value of 500 decreases the region where $t$-digest is unable to produce useful estimates to roughly the central half of the $\mathcal D_{hard}$ distribution. So, if you actually have data with this much dynamic range, you have a work-around.

Lots of code cleanups 

This release has benefited from several years to code cleanups, many due to community contributions by an army of careful readers of the code. Most of these improvements have been made to the MergingDigest to the partial neglect of the AVLTreeDigest