- 类别:
:doc:`/sql-reference/functions-aggregation`(相似性估算)、:doc:`窗口函数 </sql-reference/functions-window-syntax>`(相似性估算)
MINHASH_COMBINE¶
将多个 MinHash 输入状态组合为单个 MinHash 输出状态。然后,可以将 Minhash 状态输入到 APPROXIMATE_SIMILARITY 函数中,以估算与其他 MinHash 状态的相似性。
这允许在同一表的水平行集上运行 MINHASH 的用例,为每个行集生成 MinHash 状态。然后,这些状态可以使用 MINHASH_COMBINE 进行组合,生成与整个表的 MINHASH 单次运行相同的输出状态。
有关 MinHash 状态的更多信息,请参阅 估计两个或多个集合的相似度。
- 另请参阅:
语法¶
聚合函数
MINHASH_COMBINE( [ DISTINCT ] <state> )
窗口函数
MINHASH_COMBINE( [ DISTINCT ] <state> ) OVER ( [ PARTITION BY <expr> ] )
有关 OVER 子句的详细信息,请参阅 窗口函数的语法和用法。
实参¶
state一个表达式,其中包含通过调用 MINHASH 生成的 MinHash 状态信息。输入 MinHash 状态必须具有长度相等的数组。
使用说明¶
示例¶
USE SCHEMA snowflake_sample_data.tpch_sf1;
SELECT MINHASH_COMBINE(mh) FROM
(
(SELECT MINHASH(5, c2) mh FROM orders WHERE c2 <= 10000)
UNION
(SELECT MINHASH(5, c2) mh FROM orders WHERE c2 > 10000 AND c2 <= 20000)
UNION
(SELECT MINHASH(5, C2) mh FROM orders WHERE c2 > 20000)
);
+-----------------------+
| MINHASH_COMBINE(MH) |
|-----------------------|
| { |
| "state": [ |
| 628914288006793, |
| 1071764954434168, |
| 991489123966035, |
| 2395105834644106, |
| 680224867834949 |
| ], |
| "type": "minhash", |
| "version": 1 |
| } |
+-----------------------+
此处是更广泛的示例,显示了三个相关的函数 MINHASH、MINHASH_COMBINE 和 APPROXIMATE_SIMILARITY。此示例创建 3 个表(ta、tb 和 tc),其中两个表(ta 和 tb)相似,两个表(ta 和 tc)完全不同。
创建并使用以下值填充表:
CREATE TABLE ta (i INTEGER); CREATE TABLE tb (i INTEGER); CREATE TABLE tc (i INTEGER); -- Insert values into the 3 tables. INSERT INTO ta (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10); -- Almost the same as the preceding values. INSERT INTO tb (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (11); -- Different values and different number of values. INSERT INTO tc (i) VALUES (-1), (-20), (-300), (-4000);计算初始数据集的 minhash 信息:
CREATE TABLE minhash_a_1 (mh) AS SELECT MINHASH(100, i) FROM ta; CREATE TABLE minhash_b (mh) AS SELECT MINHASH(100, i) FROM tb; CREATE TABLE minhash_c (mh) AS SELECT MINHASH(100, i) FROM tc;向其中一个表添加更多数据:
INSERT INTO ta (i) VALUES (12);演示
MINHASH_COMBINE函数:-- Record minhash information about only the new rows: CREATE TABLE minhash_a_2 (mh) AS SELECT MINHASH(100, i) FROM ta WHERE i > 10; -- Now combine all the minhash info for the old and new rows in table ta. CREATE TABLE minhash_a (mh) AS SELECT MINHASH_COMBINE(mh) FROM ( (SELECT mh FROM minhash_a_1) UNION ALL (SELECT mh FROM minhash_a_2) );此查询显示两个相似表(ta 和 tb)的近似相似度:
SELECT APPROXIMATE_SIMILARITY (mh) FROM ( (SELECT mh FROM minhash_a) UNION ALL (SELECT mh FROM minhash_b) ); +-----------------------------+ | APPROXIMATE_SIMILARITY (MH) | |-----------------------------| | 0.75 | +-----------------------------+此查询显示两个截然不同的表(ta 和 tc)的近似相似度:
SELECT APPROXIMATE_SIMILARITY (mh) FROM ( (SELECT mh FROM minhash_a) UNION ALL (SELECT mh FROM minhash_c) ); +-----------------------------+ | APPROXIMATE_SIMILARITY (MH) | |-----------------------------| | 0 | +-----------------------------+