Categories:

Aggregate functions (Frequency Estimation) , Window function syntax and usage

APPROX_TOP_K_ACCUMULATE

Returns the Space-Saving summary at the end of aggregation. (For more information about the Space-Saving summary, see Estimating Frequent Values.)

The function APPROX_TOP_K discards its internal, intermediate state when the final cardinality estimate is returned. However, in certain advanced use cases, such as estimating incremental frequent values during bulk loading, you might want to keep the intermediate state, in which case you would use APPROX_TOP_K_ACCUMULATE instead of APPROX_TOP_K.

In contrast to APPROX_TOP_K, APPROX_TOP_K_ACCUMULATE does not return a frequency estimate of items. Instead, it returns the algorithm state itself. The intermediate state can later be:

  • Combined (that is, merged) with intermediate states from separate but related batches of data.
  • Processed by other functions that operate directly on the intermediate state, for example, APPROX_TOP_K_ESTIMATE. (For an example, see the Examples section below.)
  • 导出到外部工具。
See also:

APPROX_TOP_K_COMBINE, APPROX_TOP_K_ESTIMATE

语法

APPROX_TOP_K_ACCUMULATE( <expr> , <counters> )

实参

expr

您想要查找最常见值的表达式(例如列名称)。

counters

这是在估算过程中一次可以跟踪的非重复值数量上限。

For example, if counters is set to 100000, then the algorithm tracks 100,000 distinct values, attempting to keep the 100,000 most frequent values.

The maximum number of counters is 100000 (100,000).

使用说明

  • Decimal-float (DECFLOAT) values aren’t supported.

示例

此示例说明如何使用三个相关函数 APPROX_TOP_K_ACCUMULATE、APPROX_TOP_K_ESTIMATE 和 APPROX_TOP_K_COMBINE。

Note

此示例使用比不同数据值更多的计数器,以便获得一致的结果。在实际应用中,非重复值的数量通常大于计数器的数量,因此近似值可能会有所不同。

此示例生成一个表,其中包含值为 1-8 的 8 行,以及另一个包含 8 行的表,其值为 5-12。因此,两个表的并集中比较常见的值是值 5-8,每个值的计数为 2。

创建简单的表和数据:

CREATE OR REPLACE SEQUENCE seq91;
CREATE OR REPLACE TABLE sequence_demo (c1 INTEGER DEFAULT seq91.NEXTVAL, dummy SMALLINT);
INSERT INTO sequence_demo (dummy) VALUES (0);

INSERT INTO sequence_demo (dummy) SELECT dummy FROM sequence_demo;
INSERT INTO sequence_demo (dummy) SELECT dummy FROM sequence_demo;
INSERT INTO sequence_demo (dummy) SELECT dummy FROM sequence_demo;

Create a table that contains the “state” that represents the current approximate Top K information for the table named sequence_demo:

CREATE OR REPLACE TABLE resultstate1 AS (
  SELECT APPROX_TOP_K_ACCUMULATE(c1, 50) AS rs1
    FROM sequence_demo);

Now create a second table and add data. (In a more realistic situation, the user could have loaded more data into the first table and divided the data into non-overlapping sets based on the time that the data was loaded.)

CREATE OR REPLACE TABLE test_table2 (c1 INTEGER);
INSERT INTO test_table2 (c1) SELECT c1 + 4 FROM sequence_demo;

仅获取新数据的“状态”信息。

CREATE OR REPLACE TABLE resultstate2 AS
  (SELECT APPROX_TOP_K_ACCUMULATE(c1, 50) AS rs1
     FROM test_table2);

合并两批行的“状态”信息:

CREATE OR REPLACE TABLE combined_resultstate (c1) AS
  SELECT APPROX_TOP_K_COMBINE(rs1) AS apc1
    FROM (
      SELECT rs1 FROM resultstate1
      UNION ALL
      SELECT rs1 FROM resultstate2
    );

获取组合行集的近似 Top K 值:

SELECT APPROX_TOP_K_ESTIMATE(c1, 4)
  FROM combined_resultstate;
+------------------------------+
| APPROX_TOP_K_ESTIMATE(C1, 4) |
%------------------------------%
| [                            |
|   [                          |
|     5,                       |
|     2                        |
|   ],                         |
|   [                          |
|     6,                       |
|     2                        |
|   ],                         |
|   [                          |
|     7,                       |
|     2                        |
|   ],                         |
|   [                          |
|     8,                       |
|     2                        |
|   ]                          |
| ]                            |
+------------------------------+