估计两个或多个集合的相似度
Snowflake 使用 MinHash 来估计两个或多个数据集之间的近似相似度。MinHash 方案在不计算集合的交集或并集的情况下比较集合,从而实现高效和有效的估计。
概述
Typically, the Jaccard similarity coefficient (or index) is used to compare the similarity between two sets. For two sets, A and B, the Jaccard index is defined to be the ratio of the
size of their intersection and the size of their union:
J(A,B) = (A ∩ B) / (A ∪ B)
但是,这种计算可能会消耗大量资源和时间,因此对于大型数据集来说并不理想。
In contrast, the goal of the MinHash scheme is to estimate J(A,B) quickly, without computing the intersection or union.
SQL 函数¶
The following Aggregate functions are provided for estimating approximate similarity using MinHash:
- MINHASH: Returns a MinHash state containing a MinHash array of length k (input argument).
- MINHASH_COMBINE: Combines two (or more) input MinHash states into a single output MinHash state.
- APPROXIMATE_SIMILARITY (or APPROXIMATE_JACCARD_INDEX): Returns an estimation of the similarity (Jaccard index) of input sets based on their MinHash states.
实施细节
As detailed in MinHash (in Wikipedia):
“Let
Hbe a hash function that maps the members ofAandBto distinct integer values and, for any setS, defineH_min(S)to be the minimal member ofSwith respect toH, i.e. the membersofSwith the minimum value ofH(s), as expressed in the following equation:
H_min(S) = argmin_{s in S} (H(s))If we apply
H_minto bothAandB, we will get the same value exactly when the element of the unionA ∪ Bwith minimum hash value lies in the intersectionA ∩ B. The probability of this being true is the above ratio, therefore:
Pr[H_min(A) = H_min(B)] = J(A,B)Namely, assuming randomly chosen sets
AandB, the probability thatH_min(A) = H_min(B)holds is equal toJ(A,B). In other words, ifXis the random variable that is 1 whenH_min(A) = H_min(B)and 0 otherwise, thenXis an unbiased estimator ofJ(A,B). Note thatXhas a too large variance to be a good estimator for the Jaccard index on its own (since it is always 0 or 1).The MinHash scheme reduces this variance by averaging together several variables constructed in the same way using
knumber of different hash functions.”
In order to achieve this, the MINHASH function initially creates k number of different hash functions and applies them to every element of each input set, retaining
the minimum of each one, to produce a MinHash array (also called a MinHash state) for each set. More specifically, for i = 0 to k-1, the entry i of the MinHash array for set
A (shown by MinHash_A) corresponds to the minimum value of hash function H_i applied to every element of set A.
Finally, an approximation for the similarity of the two sets A and B is calculated as:
J_apprx(A,B) = (# of entries MinHash_A and MinHash_B agree on) / k
示例
在下面的示例中,我们展示了如何使用此方案和相应的函数来求出两个元素集合的近似相似度。
首先,创建两个示例表,并插入一些示例数据:
Then, approximate the similarity of the two sets (tables mhtab1 and mhtab2) using their MinHash states:
这两个表的相似度指数约为 0.79,而确切值为 0.8(即 4/5)。