估计两个或多个集合的相似度

Snowflake 使用 MinHash 来估计两个或多个数据集之间的近似相似度。MinHash 方案在不计算集合的交集或并集的情况下比较集合,从而实现高效和有效的估计。

概述

Typically, the Jaccard similarity coefficient (or index) is used to compare the similarity between two sets. For two sets, A and B, the Jaccard index is defined to be the ratio of the size of their intersection and the size of their union:

J(A,B) = (A ∩ B) / (A ∪ B)

但是,这种计算可能会消耗大量资源和时间,因此对于大型数据集来说并不理想。

In contrast, the goal of the MinHash scheme is to estimate J(A,B) quickly, without computing the intersection or union.

SQL 函数

The following Aggregate functions are provided for estimating approximate similarity using MinHash:

实施细节

As detailed in MinHash (in Wikipedia):

“Let H be a hash function that maps the members of A and B to distinct integer values and, for any set S, define H_min(S) to be the minimal member of S with respect to H, i.e. the member s of S with the minimum value of H(s), as expressed in the following equation:

H_min(S) = argmin_{s in S} (H(s))

If we apply H_min to both A and B, we will get the same value exactly when the element of the union A ∪ B with minimum hash value lies in the intersection A ∩ B. The probability of this being true is the above ratio, therefore:

Pr[H_min(A) = H_min(B)] = J(A,B)

Namely, assuming randomly chosen sets A and B, the probability that H_min(A) = H_min(B) holds is equal to J(A,B). In other words, if X is the random variable that is 1 when H_min(A) = H_min(B) and 0 otherwise, then X is an unbiased estimator of J(A,B). Note that X has a too large variance to be a good estimator for the Jaccard index on its own (since it is always 0 or 1).

The MinHash scheme reduces this variance by averaging together several variables constructed in the same way using k number of different hash functions.”

In order to achieve this, the MINHASH function initially creates k number of different hash functions and applies them to every element of each input set, retaining the minimum of each one, to produce a MinHash array (also called a MinHash state) for each set. More specifically, for i = 0 to k-1, the entry i of the MinHash array for set A (shown by MinHash_A) corresponds to the minimum value of hash function H_i applied to every element of set A.

Finally, an approximation for the similarity of the two sets A and B is calculated as:

J_apprx(A,B) = (# of entries MinHash_A and MinHash_B agree on) / k

示例

在下面的示例中,我们展示了如何使用此方案和相应的函数来求出两个元素集合的近似相似度。

首先,创建两个示例表,并插入一些示例数据:

CREATE OR REPLACE TABLE mhtab1(c1 NUMBER,c2 DOUBLE,c3 TEXT,c4 DATE);
CREATE OR REPLACE TABLE mhtab2(c1 NUMBER,c2 DOUBLE,c3 TEXT,c4 DATE);

INSERT INTO mhtab1 VALUES
    (1, 1.1, 'item 1', to_date('2016-11-30')),
    (2, 2.31, 'item 2', to_date('2016-11-30')),
    (3, 1.1, 'item 3', to_date('2016-11-29')),
    (4, 44.4, 'item 4', to_date('2016-11-30'));

INSERT INTO mhtab2 VALUES
    (1, 1.1, 'item 1', to_date('2016-11-30')),
    (2, 2.31, 'item 2', to_date('2016-11-30')),
    (3, 1.1, 'item 3', to_date('2016-11-29')),
    (4, 44.4, 'item 4', to_date('2016-11-30')),
    (6, 34.23, 'item 6', to_date('2016-11-29'));

Then, approximate the similarity of the two sets (tables mhtab1 and mhtab2) using their MinHash states:

SELECT APPROXIMATE_SIMILARITY(mh) FROM
    ((SELECT MINHASH(100, *) AS mh FROM mhtab1)
    UNION ALL
    (SELECT MINHASH(100, *) AS mh FROM mhtab2));

+----------------------------+
| APPROXIMATE_SIMILARITY(MH) |
%----------------------------%
|                       0.79 |
+----------------------------+

这两个表的相似度指数约为 0.79,而确切值为 0.8(即 4/5)。