MINHASH_COMBINE¶
将多个 MinHash 输入状态组合为单个 MinHash 输出状态。然后,可以将 Minhash 状态输入到 APPROXIMATE_SIMILARITY 函数中,以估算与其他 MinHash 状态的相似性。
这允许在同一表的水平行集上运行 MINHASH 的用例,为每个行集生成 MinHash 状态。然后,这些状态可以使用 MINHASH_COMBINE 进行组合,生成与整个表的 MINHASH 单次运行相同的输出状态。
有关 MinHash 状态的更多信息,请参阅 估计两个或多个集合的相似度。
- 另请参阅:
语法¶
MINHASH_COMBINE( [ DISTINCT ] <state> )
使用说明¶
DISTINCT 可以作为实参包含在内,但不起作用。
输入 MinHash
state
必须具有长度相等的 MinHash 数组。
示例¶
USE SCHEMA snowflake_sample_data.tpch_sf1;
SELECT MINHASH_COMBINE(mh) FROM
(
(SELECT MINHASH(5, c2) mh FROM orders WHERE c2 <= 10000)
UNION
(SELECT MINHASH(5, c2) mh FROM orders WHERE c2 > 10000 AND c2 <= 20000)
UNION
(SELECT MINHASH(5, C2) mh FROM orders WHERE c2 > 20000)
);
+-----------------------+
| MINHASH_COMBINE(MH) |
|-----------------------|
| { |
| "state": [ |
| 628914288006793, |
| 1071764954434168, |
| 991489123966035, |
| 2395105834644106, |
| 680224867834949 |
| ], |
| "type": "minhash", |
| "version": 1 |
| } |
+-----------------------+
此处是更广泛的示例,显示了三个相关的函数 MINHASH
、MINHASH_COMBINE
和 APPROXIMATE_SIMILARITY
。此示例创建 3 个表(ta、tb 和 tc),其中两个表(ta 和 tb)相似,两个表(ta 和 tc)完全不同。
创建并使用以下值填充表:
CREATE TABLE ta (i INTEGER); CREATE TABLE tb (i INTEGER); CREATE TABLE tc (i INTEGER); -- Insert values into the 3 tables. INSERT INTO ta (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10); -- Almost the same as the preceding values. INSERT INTO tb (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (11); -- Different values and different number of values. INSERT INTO tc (i) VALUES (-1), (-20), (-300), (-4000);计算初始数据集的 minhash 信息:
CREATE TABLE minhash_a_1 (mh) AS SELECT MINHASH(100, i) FROM ta; CREATE TABLE minhash_b (mh) AS SELECT MINHASH(100, i) FROM tb; CREATE TABLE minhash_c (mh) AS SELECT MINHASH(100, i) FROM tc;向其中一个表添加更多数据:
INSERT INTO ta (i) VALUES (12);演示
MINHASH_COMBINE
函数:-- Record minhash information about only the new rows: CREATE TABLE minhash_a_2 (mh) AS SELECT MINHASH(100, i) FROM ta WHERE i > 10; -- Now combine all the minhash info for the old and new rows in table ta. CREATE TABLE minhash_a (mh) AS SELECT MINHASH_COMBINE(mh) FROM ( (SELECT mh FROM minhash_a_1) UNION ALL (SELECT mh FROM minhash_a_2) );此查询显示两个相似表(ta 和 tb)的近似相似度:
SELECT APPROXIMATE_SIMILARITY (mh) FROM ( (SELECT mh FROM minhash_a) UNION ALL (SELECT mh FROM minhash_b) ); +-----------------------------+ | APPROXIMATE_SIMILARITY (MH) | |-----------------------------| | 0.75 | +-----------------------------+此查询显示两个截然不同的表(ta 和 tc)的近似相似度:
SELECT APPROXIMATE_SIMILARITY (mh) FROM ( (SELECT mh FROM minhash_a) UNION ALL (SELECT mh FROM minhash_c) ); +-----------------------------+ | APPROXIMATE_SIMILARITY (MH) | |-----------------------------| | 0 | +-----------------------------+