- 类别:
聚合函数 (Similarity Estimation) , Window functions (Similarity Estimation)
MINHASH¶
返回 MinHash 状态,此状态包含一个大小为 k 的数组,该数组的构造方式为:向输入行应用 k 个不同哈希函数,并保留每个哈希函数的最小值。然后,可以将此 MinHash 状态输入到 APPROXIMATE_SIMILARITY 函数中,以估计与一个或多个其他 MinHash 状态的相似性。
有关 MinHash 状态的更多信息,请参阅 估计两个或多个集合的相似度。
- 另请参阅:
语法¶
聚合函数
MINHASH( <k> , [ DISTINCT ] expr+ )
MINHASH( <k> , * )
窗口函数
MINHASH( <k> , [ DISTINCT ] expr+ ) OVER ( [ PARTITION BY <expr1> ] )
MINHASH( <k> , * ) OVER ( [ PARTITION BY <expr1> ] )
有关 OVER 子句的详细信息,请参阅 窗口函数的语法和用法。
实参¶
kThe number of hash functions to create. The larger the value, the better the approximation; however, this value has a linear impact on the computation time for estimating similarity using APPROXIMATE_SIMILARITY. The suggested value is 100. The maximum value is 1024.
expr一个或多个表达式(通常是列名),用于确定要哈希的值。
*对输入行中的所有列进行哈希处理。
使用说明¶
示例¶
USE SCHEMA snowflake_sample_data.tpch_sf1;
SELECT MINHASH(5, *) FROM orders;
+----------------------+
| MINHASH(5, *) |
|----------------------|
| { |
| "state": [ |
| 78678383574307, |
| 586952033158539, |
| 525995912623966, |
| 508991839383217, |
| 492677003405678 |
| ], |
| "type": "minhash", |
| "version": 1 |
| } |
+----------------------+
此处是更广泛的示例,显示了三个相关的函数 MINHASH、MINHASH_COMBINE 和 APPROXIMATE_SIMILARITY。此示例创建 3 个表(ta、tb 和 tc),其中两个表(ta 和 tb)相似,两个表(ta 和 tc)完全不同。
创建并使用以下值填充表:
CREATE TABLE ta (i INTEGER); CREATE TABLE tb (i INTEGER); CREATE TABLE tc (i INTEGER); -- Insert values into the 3 tables. INSERT INTO ta (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10); -- Almost the same as the preceding values. INSERT INTO tb (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (11); -- Different values and different number of values. INSERT INTO tc (i) VALUES (-1), (-20), (-300), (-4000);计算初始数据集的 minhash 信息:
CREATE TABLE minhash_a_1 (mh) AS SELECT MINHASH(100, i) FROM ta; CREATE TABLE minhash_b (mh) AS SELECT MINHASH(100, i) FROM tb; CREATE TABLE minhash_c (mh) AS SELECT MINHASH(100, i) FROM tc;向其中一个表添加更多数据:
INSERT INTO ta (i) VALUES (12);演示
MINHASH_COMBINE函数:-- Record minhash information about only the new rows: CREATE TABLE minhash_a_2 (mh) AS SELECT MINHASH(100, i) FROM ta WHERE i > 10; -- Now combine all the minhash info for the old and new rows in table ta. CREATE TABLE minhash_a (mh) AS SELECT MINHASH_COMBINE(mh) FROM ( (SELECT mh FROM minhash_a_1) UNION ALL (SELECT mh FROM minhash_a_2) );此查询显示两个相似表(ta 和 tb)的近似相似度:
SELECT APPROXIMATE_SIMILARITY (mh) FROM ( (SELECT mh FROM minhash_a) UNION ALL (SELECT mh FROM minhash_b) ); +-----------------------------+ | APPROXIMATE_SIMILARITY (MH) | |-----------------------------| | 0.75 | +-----------------------------+此查询显示两个截然不同的表(ta 和 tc)的近似相似度:
SELECT APPROXIMATE_SIMILARITY (mh) FROM ( (SELECT mh FROM minhash_a) UNION ALL (SELECT mh FROM minhash_c) ); +-----------------------------+ | APPROXIMATE_SIMILARITY (MH) | |-----------------------------| | 0 | +-----------------------------+