With the approximate CDF aggregator about to go in, I thought it would be good to collect in one place all the other approximate methods I could potentially add, and some possible uses of each. I’m sure you could all think of other uses, and help decide which of these might be impactful additions.
I’ll refer to the data being aggregated as “the stream”, though I’m only considering methods which work in the partitioned stream model. All methods compute a sketch/sample/summary S
in a single pass over the data (an aggregator), and can then support some class of queries using only the information in S
.
Unweighted Streams

Approximate CDF. (Already done.) Using only S, given any value
x
, compute an approximationS.R(x)
of the true rankR(x)
, the number of values in the stream less thanx
. Given fixed𝜀
and𝛿
, we want to compute anS
with the property that, with probability1𝛿
, for all values ofx
simultaneously,abs(S.R(x)  R(x)) ≤ 𝜀N
, whereN
is the stream size. The algorithm I implemented can do this using a sketch with asymptotic sizeO(1/𝜀 log² log(1/(𝜀𝛿)))
.I made some measurements of my implementation, and in practice using a sketch holding 500 values achieves an average error of .0025, and max error of .0072. That means if you ask for the median, you will get a value
x
withR(x) = .05 ± .0072
, probably better. The error goes down roughly linearly as you increase the size of the sketch.Example uses:
 Approximate quantiles aggregator, e.g.
mt.annotate_cols(quartiles = hl.agg.approx_quantiles(mt.foo, [0, .25, .5, .75, 1]))
 Distribution visualiztion. Compute a sketch, collect to python, then interactively explore the distribution, drawing histograms with various bin sizes, smoothed kernel approximations, etc., all locally.
 Approximate quantiles aggregator, e.g.

Cardinality estimation. Estimate the number of distinct values in the stream. (Like HyperLogLog, but there is a new method that appears to be better.) If I understand the paper correctly, for a certain parameter setting this sketch uses at most 20kB of space, and gives an unbiased estimate of the number of distinct values with standard deviation less than .004*N.

Cardinality estimation of set operations. More flexible version of cardinality estimation. Using only
S
, for an arbitrary filter predicateP
, estimate the cardinality of the stream filtered byP
. Can also use sketches from two streams to estimate the cardinalities of their union and intersection.Example uses:
 Estimate size of a filter or join, use that to choose most efficient computation path. E.g., if we expect a filter to be sufficiently small, collect and broadcast it before joining.

Counting filters. Like Bloom filters, but better. Using only
S
, for any valuex
, estimate the count ofx
(number of occurrences). With probability 1𝛿, this count is exact, and is always greater or equal to the true count, so like Bloom filters, if the filter says the count is 0, the count is really 0. Similar to heavy hitters below, but accurate for more than just the most frequent items, and in exchange uses significantly more space. This filter has better space/accuracy tradeoff than Bloom filters even in the case where values are distinct.Example uses:
 In joins where neither side is sorted by the join key, could compute a counting filter approximating the keys of one side, broadcast, and use to filter the other side to approximately the set of rows that will participate in the join (?).
Weighted Streams
Consider a keyvalue stream where the values are all positive. I’ll refer to the values as weights. The keys don’t need to be unique, but conceptually we’re interested in the deduplicated data assigning the sum of weights w(x)
for each key x
.

Heavy hitters. Using only
S
, given any keyx
, compute an approximationS.w(x)
of the true weightw(x)
ofk
. Also using onlyS
, given any threshold𝜑
, identify all keysx
withw(x) ≥ 𝜑N
, whereN
is the sum of all weights in the stream, possibly also returning a few false positives whose weight is slightly less than𝜑N
.The simplest quantitative error bound (conservative, there are stronger bounds proven) is, for a sketch of size
k
, for anyx
,0 ≤ w(x)  S.w(x) ≤ N/k
. Note that only whenw(x)
is significantly larger thanN/k
(i.e.x
is “heavy”) is this a good estimate.Example uses:
 An aggregator for categorical data to find the most frequent categories, and estimates of their frequencies.
 I think this could be used as a primitive to make a smarter mapside combiner.

Subset sums. Using only
S
, given any filter predicateP
on the keys, estimate the sum of the weights of the filtered stream. If the keys are unique, there is an optimal method which achieves the minimum possible variance in the estimates. If the keys are non unique (the stream is “unaggregated”), the method generalizes, and performs well, but no single method can be optimal for all streams.Example uses:
 Exploratory analyses, quickly trying different QC filters and estimating their effects on marginal sums (?).
Paper: http://dblp.org/rec/conf/soda/CohenDKLT09, http://dblp.org/rec/journals/pvldb/CohenDKLT09