There's a new version of $t$-digest in town. This new version doesn't much change how things work from the standpoint of the API, but it does improve accuracy under most circumstances. If you are using $t$-digest, it is a generally good idea to upgrade to the 3.2 release.
Let's dig into the changes after a quick review of how the $t$-digest works.
Density estimation through clustering
Better boundary conditions
The latest version of the $t$-digest maintains several invariants more assiduously than previous version. First of all, the minimum and maximum values ever seen are retained explicitly. Secondly, the digest makes sure that the first and last centroid always have unit weight except when the K_0 scale function is used. This improves the quality of interpolation at the end points as we will see in the next section.
Better interpolation
The most important accuracy changes to the $t$-digest (besides general code simplification and cleanups) have to do with the way that interpolation is handled. The basic idea is that centroids that represent a single point need special handling because we know exactly where that point is. Moreover, we know for the first and last centroid that at least one sample is at the global minimum (or maximum respectively) and thus we can treat the first (or last) centroid as if it has one fewer sample than observed. With the K_0 scale function, extreme centroids can have more than one sample and if such an extreme centroid has exactly two samples, we can infer the exact location of the second sample allowing us to do better interpolation. Even if there are more samples than that, we can pull the extreme sample out of the mean and get slightly better interpolation.
The case where a singleton centroid is next to a centroid with more samples is handled better as well.
How this works is illustrated here. Here we have two cases with three centroids. On the left, all three centroids have weight greater than one while on the right, two centroids have weight equal to one.
The basic idea is that we normally assume that half of the points for each centroid land on either side of the centroid. That means that the number of samples between two centroids is assumed, for interpolation purposes, to be the sum of the halves from the two adjacent centroids. But when we have a neighboring centroid with just one sample, we don't have to linearly interpolate all the way up to that centroid and thus can spread the weight from the non-singleton centroid over a slightly narrower range.
This is important near the tails of any digest, but it is really important to get this right when the total amount of data that has been added to a digest is a small-ish multiple of the compression factor of the digest. This is because any such digest will have many singleton centroids near the tail and a clean transition to non-singleton centroids near the center can be a major source of error.
The impact of this can be seen in comparisons against the relative error version of the KLL digest. Here the compression factor of the $t$-digest was adjusted to give a stored digest size of roughly 1kB or 4kB. The first value represents relatively normal operation for a $t$-digest with compression factor in the range of 100 or so while the larger value matches the size of the KLL digest. The following figure shows this comparison for varying numbers of samples.