估算百分位值
Snowflake 使用改进版的 T-Digest 算法,能以节省空间和时间的方式估算数据集中的近似百分位值。
概述
Snowflake provides an improved version of an implementation of the t-Digest algorithm papers (https://github.com/tdunning/t-digest/tree/master/docs/t-digest-paper) by Dunning and Ertl. It has been implemented through the APPROX_PERCENTILE family of functions.
根据记载,该算法具有恒定的相对误差。请注意,该算法有大量的经验支持,但未经严格证明其能够保证精度。
SQL 函数¶
The following Aggregate functions are provided for using t-Digest to approximate percentile values:
- APPROX_PERCENTILE: Returns an approximation of the desired percentile value.
- APPROX_PERCENTILE_ACCUMULATE: Skips the final estimation step and, instead, returns the intermediate t-Digest state at the end of an aggregation.
- APPROX_PERCENTILE_COMBINE: Combines (i.e. merges) multiple input states into a single output state.
- APPROX_PERCENTILE_ESTIMATE: Computes a percentile estimate of a t-Digest state produced by APPROX_PERCENTILE_ACCUMULATE or APPROX_PERCENTILE_COMBINE.
实施细节
- 无论输入的大小如何,估算都使用恒定的空间量。
- T-Digest 状态与百分位值无关。这样一来,计算一次 T-Digest 状态便能查询多个百分位值的状态。