Editor's Note: In this week's Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, gets you up to speed on the t-digest, an algorithm you can add to any anomaly detector to set the number of alarms that you get as a percentage of the total samples. It estimates percentiles very accurately–especially high or low percentiles–and allows you to set a threshold for alarms. You can find Ted's GitHub repository on t-digest here: https://github.com/tdunning/t-digest
Here's the transcript:
I'm here at Strata. It's winding down for the day, but I'd like to talk about a little bit of data mining. In particular about a technique called the T-Digest. T-Digest is a really cool box that you can add to an anomaly detector so that you can set the number of alarms as a percentage of the total samples that you get. What it does is it actually estimates percentiles very, very accurately. Especially high or low percentiles. And by estimating these accurately for you, you can set a threshold automatically by defining what percentile of the samples you'd like to have alarms. Before we go, lose the hat. Let's have a talk here. We're going to talk about this algorithm called T-digest. It's named kind of in analogy to another algorithm and an earlier one called Q-digest. And almost all the all of the previous algorithms for computing quantiles did one particular thing.
They computed approximate quantiles and the quantile, the value that you got as an estimate for quantile or percentile was accurate in terms of quantile space to within a constant error. That means you might get a median plus or minus point one percent or the 90th percentile plus or minus one percent. Now, that makes very little sense to have the 99.999 percentile plus or minus .1 percent because, that's... it. Not really the five nines anymore. The interesting thing about T-digest is that it is a very simple algorithm and it allows controllable accuracy especially at the tails. The way it does, we take some value, X that has an unknown distribution and we're going to take samples from X one through N of those samples. What we're going to do is we're going to compute this high nines or high zeros percentile and we're do it controlled accuracy.
Typically, we want part per million sort of accuracies for these very high percentiles. That's part per million in terms of percentiles. Of course if it's a very skewed distribution the actual accuracy in the original measurements might be a very large error, which might result in a very small percentile error. What we do is, T-digest is just clustering. The basic algorithm is almost identical to K means clustering, except that there's a size limit. The clusters in T-digest can be large when Q is near point five, but there's a very strict limit applied to centroids that are near zero or near one in this quantile space. In fact, the limit that we use nowadays in the T digest is the square root of Q times one minus Q.
That goes to zero at the ends, and so on. The square root comes about because the error in the estimation is proportional to the square of the fraction of the samples that is in a particular bucket, a particular centroid. That's because the linear interpolation errors are second order in this size. And that means by making the square root of this, then our error is proportional, our relative error is proportional to q times one minus Q. That means its constant relative error at the boundaries. Either at zero or at one, you get constant relative error there by having this variable limit. Now, most algorithms don't subject themselves to this limitation on size very easily. With T-digest, it's just clustering. The algorithm, essentially without some special cases is to find the nearest clusters. Now there might be several at the same distance. If one of them is small enough, meaning less than the size limit, you can just add x to it. That means increase the weight that it has and adjust the mean.
These are standard sorts of things. If the nearest cluster is too big, according to this limit, then we make X be a new cluster. That's the algorithm. There's special cases where you might have data in order or nearly in order and the number of clusters would grow because X would always be nearest the biggest thing which is always at the extreme. It always has a zero acceptable size. Therefore we keep adding new things. In that case we take all the samples we collected, randomize the order, and just rerun on those samples. That will give us the balance that we like. That's a special case, but it's not a very interesting or difficult special case.
The exciting thing is, that even though we get part per million accuracy of these high or low quantiles, the size of the T-digest for the square root limit is known to be Pi over two delta. Delta is a quality factor. One over delta is kind of the size that we're aiming at. Very common setting is one over delta equal to 200. Pi over two is one point five seven or so. Two hundred times that would be about 300 samples. That gives us part per million accuracy at the tails and very good accuracy at the median. If we wanted a little less accuracy we might pick 50. That would mean that the number of samples that are kept, the number of centroids would converge to about 78. That is a very small number. No matter how many data points we have. Millions, tens of millions, billions even. We could capture that in just a few hundred centroids.
If we use a constant size limit across there, which would be more like other algorithms in this space, then we get to a size limit of just one over delta. We would come up with 50 samples if one over delta is 50. Now, if we get rid of the square root, which gives us slightly better accuracy, and it might in particular cases a very high skew give us better accuracies. Then, the size of the q digest grows without bound but it grows very slowly. With a hundred million data points we typically have a few hundred centroids required. It still grows without bounds. That's theoretically not so nice, but practically it may not be a big problem.
This is really, really useful if you want to have quantiles of a huge number of statistical buckets. How's the viewing time in this location, in that location, in this demographic, in this demographic in that location. These things are bounded to such small sizes we can keep billions of them, literally billions, and still get very accurate quantiles. That's exciting because quantiles are a very important sort of aggregate that we might want. We could even combine these, these T-digests that are computed on several pieces of data and get a combined T-digest which has the same properties with regard to size and accuracy.
That's the T-digest. It's really simple even with the Greek letters. I don't even have any Greek letters. See? There you go. Except Pi. That hardly counts. Thank you very much!