Categories:

System functions

EXTRACT_SEMANTIC_CATEGORIES

Note

EXTRACT_SEMANTIC_CATEGORIES is a legacy function. Snowflake recommends using other methods of implementing sensitive data classification.

Returns a set of categories (semantic and privacy) for each supported column in the specified table or view. To return the categories for a column, the column must use a data type that supports classification and does not contain all NULL values.

这些类别来自列中包含的元数据和数据,以及有关列和数据的元数据。隐私类别依赖于生成的语义类别(如果有的话)。

语法

EXTRACT_SEMANTIC_CATEGORIES( '<object_name>' [ , <max_rows_to_scan> ] )

实参

必填:

object_name

包含待分类列的表、外部表、视图或物化视图的名称。如果当前会话中未使用数据库和架构,则名称必须完全限定。

名称必须与数据库中存储的名称完全相同。如果名称包含特殊字符、大写字母或空格,则名称必须先用双引号括起来,然后再放在单引号内。

可选:

max_rows_to_scan

用于确定指定表/视图中分类类别的行样本大小。

Valid values: 1 to 10000

Default: 10000

返回

作为一个代表性示例,JSON 对象具有以下结构:

{
  "valid_value_ratio": 1.0,
  "recommendation": {
    "semantic_category": "PASSPORT",
    "privacy_category": "IDENTIFIER",
    "confidence": "HIGH",
    "coverage": 0.7,
    "details": [
      {
        "semantic_category": "US_PASSPORT",
        "coverage": 0.7
      },
      {
        "semantic_category": "CA_PASSPORT",
        "coverage": 0.1
      }
    ]
  },
  "alternates": [
    {
      "semantic_category": "NATIONAL_IDENTIFIER",
      "privacy_category": "IDENTIFIER",
      "confidence": "LOW",
      "coverage": 0.3,
      "details": [
        {
          "semantic_category": "US_SSN",
          "privacy_category": "IDENTIFIER",
          "coverage": 0.3
        }
      ]
    }
  ]
}

其中:

valid_value_ratio

指定样本大小中有效值的比率。无效值包括 NULL、空字符串和字符数超过 256 个的字符串。

recommendation

指定有关每个标签和值的信息。这些信息包括:

semantic_category

指定语义类别标签值。

For possible tag values, see Native semantic categories of sensitive data classification.

privacy_category

指定隐私类别标签值。

The possible values are IDENTIFIER, QUASI-IDENTIFIER and SENSITIVE.

confidence

Specifies one of the following values: HIGH, MEDIUM, or LOW. This value indicates the relative confidence that Snowflake has based upon the column sampling process and how the column data aligns with how Snowflake classifies data.

coverage

指定与特定类别的规则匹配的采样单元格值的百分比。

details

指定可为 SEMANTIC_CATEGORY 标签指定地理标签值的字段和值。

alternates

指定除推荐标签之外要考虑的每个标签和值的信息。

使用说明

  • 该函数需要一个运行中的仓库。仓库会影响性能和成本。
  • This function is no longer being updated to coincide with additional enhancements to Data Classification.

示例

Extract the semantic and privacy categories for the my_db.my_schema.hr_data table using the default (10000) for the number of rows to scan:

USE ROLE data_engineer;

USE WAREHOUSE classification_wh;

SELECT EXTRACT_SEMANTIC_CATEGORIES('my_db.my_schema.hr_data');

与上一个示例相同,但仅限于扫描表中的 5000 行:

USE ROLE data_engineer;

SELECT EXTRACT_SEMANTIC_CATEGORIES('my_db.my_schema.hr_data', 5000);

与第一个示例相同,但将结果存储在表中:

USE ROLE data_engineer;

CREATE OR REPLACE TABLE classification_results(v VARIANT) AS
  SELECT EXTRACT_SEMANTIC_CATEGORIES('my_db.my_schema.hr_data');

Once the results are stored in a table, you can revise them before using ASSOCIATE_SEMANTIC_CATEGORY_TAGS to apply them.