List of approximate methods

With the approximate CDF aggregator about to go in, I thought it would be good to collect in one place all the other approximate methods I could potentially add, and some possible uses of each. I’m sure you could all think of other uses, and help decide which of these might be impactful additions.

I’ll refer to the data being aggregated as “the stream”, though I’m only considering methods which work in the partitioned stream model. All methods compute a sketch/sample/summary S in a single pass over the data (an aggregator), and can then support some class of queries using only the information in S.

Unweighted Streams

  • Approximate CDF. (Already done.) Using only S, given any value x, compute an approximation S.R(x) of the true rank R(x), the number of values in the stream less than x. Given fixed 𝜀 and 𝛿, we want to compute an S with the property that, with probability 1-𝛿, for all values of x simultaneously, abs(S.R(x) - R(x)) ≤ 𝜀N, where N is the stream size. The algorithm I implemented can do this using a sketch with asymptotic size O(1/𝜀 log² log(1/(𝜀𝛿))).

    I made some measurements of my implementation, and in practice using a sketch holding 500 values achieves an average error of .0025, and max error of .0072. That means if you ask for the median, you will get a value x with R(x) = .05 ± .0072, probably better. The error goes down roughly linearly as you increase the size of the sketch.

    Example uses:

    • Approximate quantiles aggregator, e.g.
      mt.annotate_cols(quartiles = hl.agg.approx_quantiles(mt.foo, [0, .25, .5, .75, 1]))
      
    • Distribution visualiztion. Compute a sketch, collect to python, then interactively explore the distribution, drawing histograms with various bin sizes, smoothed kernel approximations, etc., all locally.

    Paper: http://arxiv.org/abs/1603.05346v2

  • Cardinality estimation. Estimate the number of distinct values in the stream. (Like HyperLogLog, but there is a new method that appears to be better.) If I understand the paper correctly, for a certain parameter setting this sketch uses at most 20kB of space, and gives an unbiased estimate of the number of distinct values with standard deviation less than .004*N.

    Paper: https://arxiv.org/abs/1708.06839

  • Cardinality estimation of set operations. More flexible version of cardinality estimation. Using only S, for an arbitrary filter predicate P, estimate the cardinality of the stream filtered by P. Can also use sketches from two streams to estimate the cardinalities of their union and intersection.

    Example uses:

    • Estimate size of a filter or join, use that to choose most efficient computation path. E.g., if we expect a filter to be sufficiently small, collect and broadcast it before joining.

    Paper: http://arxiv.org/abs/1510.01455v2

  • Counting filters. Like Bloom filters, but better. Using only S, for any value x, estimate the count of x (number of occurrences). With probability 1-𝛿, this count is exact, and is always greater or equal to the true count, so like Bloom filters, if the filter says the count is 0, the count is really 0. Similar to heavy hitters below, but accurate for more than just the most frequent items, and in exchange uses significantly more space. This filter has better space/accuracy tradeoff than Bloom filters even in the case where values are distinct.

    Example uses:

    • In joins where neither side is sorted by the join key, could compute a counting filter approximating the keys of one side, broadcast, and use to filter the other side to approximately the set of rows that will participate in the join (?).

    Paper: http://dl.acm.org/citation.cfm?doid=3035918.3035963

Weighted Streams
Consider a key-value stream where the values are all positive. I’ll refer to the values as weights. The keys don’t need to be unique, but conceptually we’re interested in the deduplicated data assigning the sum of weights w(x) for each key x.

  • Heavy hitters. Using only S, given any key x, compute an approximation S.w(x) of the true weight w(x) of k. Also using only S, given any threshold 𝜑, identify all keys x with w(x) ≥ 𝜑N, where N is the sum of all weights in the stream, possibly also returning a few false positives whose weight is slightly less than 𝜑N.

    The simplest quantitative error bound (conservative, there are stronger bounds proven) is, for a sketch of size k, for any x, 0 ≤ w(x) - S.w(x) ≤ N/k. Note that only when w(x) is significantly larger than N/k (i.e. x is “heavy”) is this a good estimate.

    Example uses:

    • An aggregator for categorical data to find the most frequent categories, and estimates of their frequencies.
    • I think this could be used as a primitive to make a smarter map-side combiner.

    Paper: http://arxiv.org/abs/1603.05346v2

  • Subset sums. Using only S, given any filter predicate P on the keys, estimate the sum of the weights of the filtered stream. If the keys are unique, there is an optimal method which achieves the minimum possible variance in the estimates. If the keys are non unique (the stream is “unaggregated”), the method generalizes, and performs well, but no single method can be optimal for all streams.

    Example uses:

    • Exploratory analyses, quickly trying different QC filters and estimating their effects on marginal sums (?).

    Paper: http://dblp.org/rec/conf/soda/CohenDKLT09, http://dblp.org/rec/journals/pvldb/CohenDKLT09