估计频次值
Snowflake 使用节省空间算法,这是一种节省空间和时间的有效方法,用于估计数据集中的近似频次值。
概述
Snowflake provides an implementation of the Space-Saving algorithm presented in Efficient Computation of Frequent and Top-k Elements in Data Streams (https://www.cs.ucsb.edu/research/tech-reports/2005-23) by Metwally, Agrawal and Abbadi. It is implemented through the APPROX_TOP_K family of functions.
Additionally, the APPROX_TOP_K_COMBINE function utilizes the parallel Space-Saving algorithm (https://arxiv.org/abs/1401.0702) outlined by Cafaro, Pulimeno and Tempesta.
算法的错误百分比很大程度上取决于数据的偏斜程度以及算法中使用的计数器数量。随着数据变得更加偏斜,或者使用更多的计数器,输出将更加准确。
SQL 函数¶
The following Aggregate functions are provided for using Space-Saving to estimate frequent values:
- APPROX_TOP_K: Returns an approximation of frequent values in the input.
- APPROX_TOP_K_ACCUMULATE: Skips the final estimation step and returns the Space-Saving state at the end of an aggregation.
- APPROX_TOP_K_COMBINE: Combines (i.e. merges) input states into a single output state.
- APPROX_TOP_K_ESTIMATE: Computes a cardinality estimate of a Space-Saving state produced by APPROX_TOP_K_ACCUMULATE and APPROX_TOP_K_COMBINE.
实施细节
实施中的每个计数器都会跟踪一个项目及其频率。值得注意的是,我们的实施不跟踪计数器的 epsilon 值,因为它们仅用于保证算法的输出,不用于算法本身。
计数器的最大数量设置为 10 万。在这种情况下,内存中存储了 10 万个计数器,但其中只有一小部分以导出状态存储。
The maximum number of k is 100 thousand. This value is automatically reduced if all the values cannot fit in the output.
在大多数情况下,实施的运行时不依赖于计数器的数量。实施确保计数器的数量不会对算法的运行时产生明显影响。
Each counter in each aggregation state uses a constant amount of memory overhead of around 100 bytes. Thus, if an aggregation uses c counters and there are g aggregation groups, the
aggregation will use c * g * 100B of memory, plus memory to store the values. If this memory exceeds the total memory budget, memory is spilled to disk. This is far less memory than the
exact version would use, especially when there a large number of unique values.