估计非重复值的数量

Snowflake 使用 HyperLogLog 估计数据集中非重复值的大致数量。HyperLogLog 是一种更先进的基数估计算法，能够估计数万亿行的不同基数，平均相对误差为百分之几。

HyperLogLog can be used in place of COUNT(DISTINCT …) in situations where estimating cardinality is acceptable.

概述

Snowflake provides a bias-corrected implementation of the HyperLogLog algorithm presented in HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm (http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) by Flajolet et al.

We recommend using HyperLogLog whenever the input is potentially large and an approximate result is acceptable. The average relative error of our HyperLogLog implementation is 1.62338% (i.e. the average relative difference to the corresponding COUNT(DISTINCT …) result).

SQL 函数¶

The following Aggregate functions are provided for estimating cardinality using HyperLogLog:

HLL: Returns an approximation of the distinct cardinality of the input.
HLL_ACCUMULATE: Skips the final estimation step and returns the HyperLogLog state at the end of an aggregation.
HLL_COMBINE: Combines (i.e. merges) input states into a single output state.
HLL_ESTIMATE: Computes a cardinality estimate of a HyperLogLog state produced by HLL_ACCUMULATE and HLL_COMBINE.
HLL_EXPORT: Converts HyperLogLog states from BINARY format to an OBJECT (which can then be printed and exported as JSON).
HLL_IMPORT: Converts HyperLogLog states from OBJECT format to BINARY format.

实施细节

我们的实现将输入行哈希为 64 位值，其中上 12 位或“精度”（如 HyperLogLog 算法论文中所述；有关详细信息，请参阅上面的文档链接）用于将输入值划分为所谓的子流。这会产生以下平均相对误差：

sqrt(3*ln(2)-1)/sqrt(2^precision) = 0.0162338 = 1.62338%

In other words, for a query where COUNT(DISTINCT …) would return a result of 1,000,000, HyperLogLog typically returns a result in the range of 983,767 to 1,016,234.

For each sub-stream, HyperLogLog maintains the maximum leading-zero count (between 0 and 52 for 64-bit values at precision = 12). The most straight-forward representation of this state is a simple byte array, one byte for each of the 2^12 = 4096 sub-streams. Our implementation indeed requires at most 4096 Byte (2^precision = 2^12 = 4096) of memory per aggregation group. Technically, only 6 bits (rather than 8 bits) are required per sub-stream, but we trade some space efficiency for computational efficiency.

For small input cardinalities, most of the sub-streams will never be hit. So rather than allocating an entire block of 4096 Byte per aggregation group up-front, our implementation uses a space-optimized “sparse” representation of this state whenever beneficial. Consequently, the memory cost of HyperLogLog can be substantially lower than 4096 Byte per aggregation group (down to about 32 Byte per aggregation group). This allows cardinality estimation over many aggregation groups (millions or even billions, as determined by the GROUP BY or OVER clause of the query), using orders of magnitude less memory and CPU time than a corresponding COUNT(DISTINCT …) query.

在（罕见）情况下，如果一个非常大的输入表和许多聚合组导致 HyperLogLog 超出其总内存预算，Snowflake 仍然能够溢出到临时空间并执行递归聚合，就像任何其他聚合函数一样。

导出的状态格式

The state of the HyperLogLog algorithm can be exported and imported (or reimported) using the HLL_EXPORT and HLL_IMPORT functions, respectively. The exported state is of type OBJECT and contains the following fields.

密集格式

version:: HyperLogLog 实现的版本号。
precision:: 用于选择子流的哈希值位数。目前固定为 12。
dense:: 整数数组，每个整数包含相应子流的最大前导零计数 + 1。0 表示尚未命中相应的子流。合法值在 0 到 53 的范围内。相应的子流索引由数组中的元素位置给出。

例如：

{
  "version" : 3,
  "precision" : 12,
  "dense" : [3,3,3,3,5,3,4,3,5,6,2,4,4,7,5,6,6,3,2,2,3,2,4,5,5,5,2,5,5,3,6,1,4,2,2,4,4,5,2,5,...,4,6,3]
}

稀疏格式

version:

HyperLogLog 实现的版本号。

precision:

用于选择子流的哈希值位数。目前固定为 12。

sparse:

indices: An array of integers, each containing a sub-stream index (base 0). Legal values are in the range of 0 to 4095.

maxLzCounts: An array of integers, each containing the maximum leading-zero count + 1 for the corresponding sub-stream. 0 indicates that the corresponding sub-stream has not been hit yet.: Legal values are in the range of 0 to 53. The sub-stream for a given leading-zero count is given by the corresponding element in the indices array.

The indices and maxLzCounts arrays must have the same length. The HLL_IMPORT function also checks that sub-stream indices are in the valid range, and that there are no duplicate sub-stream indices. The indices array need not be sorted. The leading-zero counts are not validated. Invalid values will not cause query failures, but will lead to undefined results for HLL_ESTIMATE.

例如：

{
  "version" : 3,
  "precision" : 12,
  "sparse" : {
    "indices": [1131,1241,1256,1864,2579,2699,3730],
    "maxLzCounts":[2,4,2,1,3,2,1]
  }
}

示例

环境设置：

USE WAREHOUSE dontdrop;
USE DATABASE stressdb;
USE SCHEMA bdb_5nodes;

SELECT COUNT(*) FROM uservisits;

-----------+
 COUNT(*)  |
-----------+
 751754869 |
-----------+

第 1 步：

Create a table that contains the calendar date (year/month/day) and the HLL structure. We use HLL_EXPORT to store the binary structure as a text object:

CREATE OR REPLACE TABLE daily_uniques
AS
SELECT
 visitdate,
 hll_export(hll_accumulate(sourceip)) AS hll_sourceip
FROM uservisits
GROUP BY visitdate;

第 2 步：

We can calculate the unique IPs by month by aggregating each day’s HLL structure from Step 1. We use HLL_IMPORT to transform the text to the binary structure, then HLL_COMBINE to combine multiple HLL structures into a single structure, then HLL_ESTIMATE to compute the number of distinct values:

SELECT
  EXTRACT(year FROM visitdate) AS visit_year,
  EXTRACT(month FROM visitdate) AS visit_month,
  hll_estimate(hll_combine(hll_import(hll_sourceip))) AS distinct_ips
FROM daily_uniques
WHERE visitdate BETWEEN '2000-01-01' AND '2000-12-31'
GROUP BY 1,2
ORDER BY 1,2;

------------+-------------+--------------+
 VISIT_YEAR | VISIT_MONTH | DISTINCT_IPS |
------------+-------------+--------------+
       2000 |           1 |      1515168 |
       2000 |           2 |      1410289 |
       2000 |           3 |      1491997 |
       2000 |           4 |      1460837 |
       2000 |           5 |      1546647 |
       2000 |           6 |      1485599 |
       2000 |           7 |      1522643 |
       2000 |           8 |      1492831 |
       2000 |           9 |      1488507 |
       2000 |          10 |      1553201 |
       2000 |          11 |      1461140 |
       2000 |          12 |      1515772 |
------------+-------------+--------------+

Elapsed 1.3s

比较：

We compare the use of the aggregation using the HLL functions to HLL on the detail level data. In this case, HLL() is equivalent to the HLL_ESTIMATE(HLL_COMBINE(HLL_IMPORT())) from Step 2:

SELECT
  EXTRACT(year FROM visitdate) AS visit_year,
  EXTRACT(month FROM visitdate) AS visit_month,
  hll(sourceip) AS distinct_ips
FROM uservisits
WHERE visitdate BETWEEN '2000-01-01' AND '2000-12-31'
GROUP BY 1,2
ORDER BY 1,2;

------------+-------------+--------------+
 VISIT_YEAR | VISIT_MONTH | DISTINCT_IPS |
------------+-------------+--------------+
       2000 |           1 |      1515168 |
       2000 |           2 |      1410289 |
       2000 |           3 |      1491997 |
       2000 |           4 |      1460837 |
       2000 |           5 |      1546647 |
       2000 |           6 |      1485599 |
       2000 |           7 |      1522643 |
       2000 |           8 |      1492831 |
       2000 |           9 |      1488507 |
       2000 |          10 |      1553201 |
       2000 |          11 |      1461140 |
       2000 |          12 |      1515772 |
------------+-------------+--------------+

Elapsed 2m 29s

正如您所看到的，HLL 结构的聚合明显快于基础数据的聚合，例如，在这个小示例中，1.3 秒对 149 秒，这意味着查询时间减少了 100 倍。