MINHASH¶
返回 MinHash 状态,此状态包含一个大小为 k
的数组,该数组的构造方式为:向输入行应用 k
个不同哈希函数,并保留每个哈希函数的最小值。然后,可以将此 MinHash 状态输入到 APPROXIMATE_SIMILARITY 函数中,以估计与一个或多个其他 MinHash 状态的相似性。
有关 MinHash 状态的更多信息,请参阅 估计两个或多个集合的相似度。
- 另请参阅:
语法¶
MINHASH( <k> , [ DISTINCT ] expr+ )
MINHASH( <k> , * )
使用说明¶
k
指定要创建的哈希函数的数量。该值越大,近似值越准确;但是,此值对使用 APPROXIMATE_SIMILARITY 相似性估算的计算时间具有线性影响。建议值为 100。最大值为 1024。
DISTINCT 可以作为实参包含在内,但不起作用。
示例¶
USE SCHEMA snowflake_sample_data.tpch_sf1;
SELECT MINHASH(5, *) FROM orders;
+----------------------+
| MINHASH(5, *) |
|----------------------|
| { |
| "state": [ |
| 78678383574307, |
| 586952033158539, |
| 525995912623966, |
| 508991839383217, |
| 492677003405678 |
| ], |
| "type": "minhash", |
| "version": 1 |
| } |
+----------------------+
此处是更广泛的示例,显示了三个相关的函数 MINHASH
、MINHASH_COMBINE
和 APPROXIMATE_SIMILARITY
。此示例创建 3 个表(ta、tb 和 tc),其中两个表(ta 和 tb)相似,两个表(ta 和 tc)完全不同。
创建并使用以下值填充表:
CREATE TABLE ta (i INTEGER); CREATE TABLE tb (i INTEGER); CREATE TABLE tc (i INTEGER); -- Insert values into the 3 tables. INSERT INTO ta (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10); -- Almost the same as the preceding values. INSERT INTO tb (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (11); -- Different values and different number of values. INSERT INTO tc (i) VALUES (-1), (-20), (-300), (-4000);计算初始数据集的 minhash 信息:
CREATE TABLE minhash_a_1 (mh) AS SELECT MINHASH(100, i) FROM ta; CREATE TABLE minhash_b (mh) AS SELECT MINHASH(100, i) FROM tb; CREATE TABLE minhash_c (mh) AS SELECT MINHASH(100, i) FROM tc;向其中一个表添加更多数据:
INSERT INTO ta (i) VALUES (12);演示
MINHASH_COMBINE
函数:-- Record minhash information about only the new rows: CREATE TABLE minhash_a_2 (mh) AS SELECT MINHASH(100, i) FROM ta WHERE i > 10; -- Now combine all the minhash info for the old and new rows in table ta. CREATE TABLE minhash_a (mh) AS SELECT MINHASH_COMBINE(mh) FROM ( (SELECT mh FROM minhash_a_1) UNION ALL (SELECT mh FROM minhash_a_2) );此查询显示两个相似表(ta 和 tb)的近似相似度:
SELECT APPROXIMATE_SIMILARITY (mh) FROM ( (SELECT mh FROM minhash_a) UNION ALL (SELECT mh FROM minhash_b) ); +-----------------------------+ | APPROXIMATE_SIMILARITY (MH) | |-----------------------------| | 0.75 | +-----------------------------+此查询显示两个截然不同的表(ta 和 tc)的近似相似度:
SELECT APPROXIMATE_SIMILARITY (mh) FROM ( (SELECT mh FROM minhash_a) UNION ALL (SELECT mh FROM minhash_c) ); +-----------------------------+ | APPROXIMATE_SIMILARITY (MH) | |-----------------------------| | 0 | +-----------------------------+