自动对敏感数据进行分类

Automatic sensitive data classification is a serverless feature that enables the automatic detection and tagging of sensitive data. The feature continuously monitors tables within a specific database and classifies their columns using native and custom classification categories.

自动敏感数据分类可助力数据工程师和管理员完成以下工作:

  • 演示如何自动对表进行分类,以满足内部管理和合规需求。

  • 确保适当标记敏感数据。

  • 确保采用正确的访问控制,以保护敏感数据。

开始使用

自动对敏感数据进行分类的基本工作流程包括以下内容:

  1. Create a classification profile that controls how often sensitive data in a database is automatically classified, including whether system tags should be automatically applied after classification.

  2. (可选)使用分类配置文件将用户定义的标签映射到系统标签,以便包含敏感数据的列可以根据其分类与用户定义的标签关联。

  3. (可选)在分类配置文件中添加 自定义分类器,以便使用用户定义的语义和隐私类别对敏感数据进行自动分类。

  4. Set the classification profile on a database so that tables in the database get automatically classified.

有关该工作流程的端到端示例,请参阅 示例

关于分类配置文件

A data engineer creates a classification profile by creating an instance of the CLASSIFICATION_PROFILE class to define the criteria that are used to automatically classify tables in a database. This criteria includes:

  • 表在自动分类前应存在多长时间。

  • 之前已分类的表应在多长时间内重新分类。

  • 分类后,是否在列上自动设置系统和自定义标签。您可以决定是让 Snowflake 自动应用推荐的标签,还是选择先查看系统建议的标签分配,然后再由您自行应用。

  • 在系统分类标签和用户定义的对象标签之间 映射,以便自动应用用户定义的标签。

When the data engineer assigns the classification profile to a database, sensitive data in the tables that belong to the database is automatically classified on the schedule defined by the profile. A data engineer can assign the same classification profile to multiple databases, or create multiple classification profiles if there is a need to set different classification criteria for different databases.

数据自动分类过程需要访问表中的原始数据。原始数据包括为列指定了掩码策略的表。不过,Snowflake 通过使用内部角色对数据进行自动分类,保留了规范受保护数据访问的意图。内部角色可以访问受掩码策略保护的数据,但用户无法访问该角色。

有关使用 CREATE CLASSIFICATION_PROFILE 命令创建分类配置文件的示例,请参阅 示例

Excluding objects from automatic sensitive data classification

默认情况下,Snowflake 会自动对设置了分类配置文件的数据库中的所有敏感数据进行分类。您可以将 Snowflake 配置为从自动分类中排除架构、表或列,以便在分类过程中跳过它们。

For more information, see 从自动敏感数据分类中排除数据.

关于标签映射

您可以使用分类配置文件将 SEMANTIC_CATEGORY 系统标记 映射到一个或多个 对象标签。这种标签映射允许根据敏感数据的分类自动为列分配用户定义的标签。标签图可以在创建分类配置文件时添加,也可以稍后通过调用 <classification_profile_name>!SET_TAG_MAP 方法添加。

由于用户自定义的对象标签可以关联掩码策略,您可以借助标签映射来启用自动的 基于标签的掩码。如果您选择在数据分类之后自动应用标签,就可以实现整个数据保护流程的自动化,系统会根据数据的分类结果,自动为相关列分配掩码策略。随着新数据不断添加到架构中,这些基于标签的掩码策略也会自动应用到包含敏感信息的列上。

Regardless of whether you are defining the tag map while creating the classification profile or after, the contents of the map are specified as a JSON object. This JSON object contains the 'column_tag_map' key, which is an array of objects that specify a user-defined tag, the string value of that tag, and the semantic categories to which the tag is being mapped. After the tag map is associated with a classification profile and you automatically classify tables in a database, the tag is assigned to the columns that correspond to the semantic categories.

以下是标签映射示例:

'tag_map': {
  'column_tag_map': [
    {
      'tag_name':'tag_db.sch.pii',
      'tag_value':'Highly Confidential',
      'semantic_categories':[
        'NAME',
        'NATIONAL_IDENTIFIER'
      ]
    },
    {
      'tag_name': 'tag_db.sch.pii',
      'tag_value':'Confidential',
      'semantic_categories': [
        'EMAIL'
      ]
    }
  ]
}
Copy

根据此映射,如果您有一列电子邮件地址,且分类过程确定该列包含这些地址,则会在包含电子邮件地址的列上设置 tag_db.sch.pii = 'Confidential' 标签。

如果您的标签映射包括多个映射标签、标签值和类别值的 JSON 对象,则 JSON 对象的顺序将决定在发生冲突时在列上设置哪个标签和值。按所需的赋值顺序从左到右指定 JSON 对象,如果是格式化 JSON,则从上到下。

小技巧

column_tag_map 字段中的每个对象只有一个必填键:tag_name。如果省略 tag_valuesemantic_categories 键,用户定义的标签将应用于 SEMANTIC_CATEGORY 系统标签所应用的每一列,并且用户定义的标签值将与给定列的 SEMANTIC_CATEGORY 标签值相匹配。

如果手动分配的标签与自动分类应用的标签发生冲突,则会出现错误。有关跟踪这些错误的信息,请参阅 故障排除

实施自定义分类

借助 Snowflake,您可以定义 自定义分类器,进而使用自定义逻辑对敏感数据进行识别和分类。例如,您可以使用正则表达式识别 ICD-10 代码创建一个自定义分类器,并将其归入语义类别 ICD_10_CODES

创建自定义分类器后,您可以将其添加到分类配置文件中,以便 Snowflake 根据其逻辑对数据进行自动分类。您可以在创建分类配置文件时或通过调用 <classification_profile_name>!SET_CUSTOM_CLASSIFIERS 方法添加自定义分类器。

Adding both custom classifiers and a tag map in your classification profile provides a powerful governance solution. It allows you to automatically classify data based on your knowledge of what is sensitive and apply a user-defined tag that you can track. If you use this user-defined tag to implement tag-based masking, your domain-specific sensitive data is automatically protected by a masking policy as data is added to a database.

重要

自动分类会存储自定义分类器的定义,而非存储引用。如果更改自定义分类器,必须使用 SET_CUSTOM_CLASSIFIERS 方法将分类配置文件更新为新的定义。

Set a classification profile on a database

You implement automatic sensitive data classification by setting a classification profile on a database. After you set the classification profile on the database, all tables and views within that database are automatically monitored by sensitive data classification.

You can also set a classification on a schema. If you set a classification profile on a schema that exists within a database that is also associated with a classification profile, the profile set on the schema overrides the profile set on the database.

To set a classification profile, use an ALTER DATABASE or ALTER SCHEMA command to set the CLASSIFICATION_PROFILE parameter. For example, to set a classification profile my_profile so all tables and views in the my_db database are monitored by automatic sensitive data classification, run the following command:

ALTER DATABASE my_db
  SET CLASSIFICATION_PROFILE = 'governance_db.classify_sch.my_profile';
Copy

对视图进行分类

默认情况下,敏感数据分类不会对视图中的数据进行分类。如果不更改默认值,则仅对表进行分类。

对视图进行分类的成本可能比对表进行分类的成本更高。额外成本因创建视图的查询的复杂程度而异。物化视图不会像其他视图一样产生额外费用。

分类配置文件的配置对象中的 classify_views 键决定是否对视图进行分类。以下分类配置文件会更改默认值,以便对视图进行分类:

CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE
  my_classification_profile(
    {
      'minimum_object_age_for_classification_days': 0,
      'maximum_classification_validity_days': 30,
      'classify_views': true
    });
Copy

You can also enable or disable the classification of views using the SET_CLASSIFY_VIEWS method.

通过自动分类确定哪些对象受到监控

通过列出与分类配置文件关联的数据库和架构,您可以确定通过自动敏感数据分类来监控哪些数据。如果数据库或架构与分类配置文件相关联,则该实体中的所有表和视图将根据配置文件中定义的标准自动进行分类。

使用 SYSTEM$SHOW_SENSITIVE_DATA_MONITORED_ENTITIES 函数列出与分类配置文件关联的数据库和架构。您可以只列出数据库,只列出架构,也可以列出所有数据库和架构。例如,要列出与分类配置文件关联的所有数据库和架构,请运行以下命令:

SELECT SYSTEM$SHOW_SENSITIVE_DATA_MONITORED_ENTITIES();
Copy

输出列出了数据库或架构的名称、其类型和分类配置文件。以下是输出的示例:

[
{"name":"HR_DB","type":"DATABASE","profile_name":"GOV_DB.CLASSIFY_SCH.MY_CLASSIFICATION_PROFILE"},
{"name":"SALES_DB.SCH1","type":"SCHEMA","profile_name":"GOV_DB.CLASSIFY_SCH.TEST_PROFILE"}
]

查看自动分类结果

您可以通过以下方式查看自动分类的结果:

  • 调用 SYSTEM$GET_CLASSIFICATION_RESULT 存储过程。例如:

    CALL SYSTEM$GET_CLASSIFICATION_RESULT('mydb.sch.t1');
    
    Copy

    You cannot return results until the classification process completes. The automatic classification process does not start until one hour after setting the classification profile on the database.

  • 使用被授予 SNOWFLAKE.GOVERNANCE_VIEWER 数据库角色的角色查询 DATA_CLASSIFICATION_LATEST 视图。例如:

    SELECT * FROM snowflake.account_usage.data_classification_latest;
    
    Copy

    分类完成三小时后,结果可能才会出现。

限制

  • 不能在阅读者账户上设置分类配置文件。

  • Only one classification profile can be set on a database or schema.

  • A classification profile cannot be set on more than 1,000 databases.

  • A classification profile cannot be directly set on more than 10,000 schemas.

  • 一个架构中最多可分类 1 亿个表。

  • 如果表具有以下任意一项特征,则无法对其进行自动分类:

    • More than 10,000 columns.

    • A column with a name that has more than 255 characters.

    • A column with a name that includes the $ character.

    • 来自共享。

访问控制

本不分介绍可让您使用分类配置文件和启用敏感数据自动分类的权限和角色。

任务

所需权限/角色

备注

创建分类配置文件

SNOWFLAKE.CLASSIFICATION_ADMIN 数据库角色

有关向其他角色授予此数据库角色的信息,请参阅 使用 SNOWFLAKE 数据库角色

对架构的 CREATE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE 权限

对于要创建分类配置文件实例的架构,您需要拥有对该架构的此权限。

USAGE on database and schema

You need privileges on the schema where you want to create the classification profile instance.

在数据库/架构上设置分类配置文件

以下其中一项:

  • 对账户的 EXECUTE AUTO CLASSIFICATION 权限

  • 对数据库/架构的 EXECUTE AUTO CLASSIFICATION 权限

默认情况下,数据库/架构所有者拥有 EXECUTE AUTO CLASSIFICATION 权限。

Any privilege on schema's database

If setting a classification profile on a schema, you need at least one privilege on the database that contains that schema.

对数据库/架构的任何权限

对于包含要自动分类的表的架构,您至少需要对数据库/架构拥有一项权限。EXECUTE AUTO CLASSIFICATION 权限符合这一要求。

以下其中一项:

  • 对分类配置文件实例的 OWNERSHIP 权限。

  • 分类配置文件上的 <classification_profile>!PRIVACY_USER 实例角色。

有关向其他角色授予 PRIVACY_USER 实例角色的信息,请参阅 实例角色

对账户的 APPLY TAG 权限

在分类配置文件实例上调用 方法

<classification_profile>!PRIVACY_USER 实例角色

有关向其他角色授予此实例角色的信息,请参阅 实例角色

列出分类配置文件

<classification_profile>!PRIVACY_USER 实例角色

删除分类配置文件

对分类配置文件实例的 OWNERSHIP 权限

有关向数据工程师角色授予这些权限和数据库角色的示例,请参阅 Basic example: Automatically classifying tables in a database

自动分类敏感数据的成本

Automatic sensitive data classification consumes credits as it uses serverless compute resources to classify tables in the database. For more information about pricing for this consumption, see Table 5 in the Snowflake Service Consumption Table.

您可以查询 ACCOUNT_USAGE 和 ORGANIZATION_USAGE 架构中的视图,以确定用于自动分类敏感数据的费用。要监控 credit 使用,请查询以下视图:

METERING_HISTORY 视图 (ACCOUNT_USAGE)

通过关注 SERVICE_TYPE 列中的 SENSITIVE_DATA_CLASSIFICATION,您可以检索自动分类的每小时成本。例如:

SELECT
  service_type,
  start_time,
  end_time,
  entity_id,
  name,
  credits_used_compute,
  credits_used_cloud_services,
  credits_used,
  budget_id
  FROM snowflake.account_usage.metering_history
  WHERE service_type = 'SENSITIVE_DATA_CLASSIFICATION';
Copy
METERING_DAILY_HISTORY 视图(ACCOUNT_USAGE 和 ORGANIZATION_USAGE)

通过关注 SERVICE_TYPE 列中的 SENSITIVE_DATA_CLASSIFICATION,您可以检索自动分类的每日成本。例如:

SELECT
  service_type,
  usage_date,
  credits_used_compute,
  credits_used_cloud_services,
  credits_used
  FROM snowflake.account_usage.metering_daily_history
  WHERE service_type = 'SENSITIVE_DATA_CLASSIFICATION';
Copy
USAGE_IN_CURRENCY_DAILY (ORGANIZATION_USAGE)

通过关注 SERVICE_TYPE 列中的 SENSITIVE_DATA_CLASSIFICATION,您可以检索自动分类的每日成本。使用此视图可以确定货币成本,而不是 Credit。

示例

Basic example: Automatically classifying tables in a database

Complete these steps to automatically classify a table in the database:

  1. As an administrator, give the data engineer the roles and privileges they need to automatically classify tables in a database.

    USE ROLE ACCOUNTADMIN;
    
    GRANT USAGE ON DATABASE mydb TO ROLE data_engineer;
    GRANT EXECUTE AUTO CLASSIFICATION ON DATABASE mydb TO ROLE data_engineer;
    
    GRANT DATABASE ROLE SNOWFLAKE.CLASSIFICATION_ADMIN TO ROLE data_engineer;
    GRANT CREATE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE ON SCHEMA mydb.sch TO ROLE data_engineer;
    
    GRANT APPLY TAG ON ACCOUNT TO ROLE data_engineer;
    
    Copy
  2. 切换到数据工程师角色:

    USE ROLE data_engineer;
    
    Copy
  3. 将分类配置文件 创建为 CLASSIFICATION_PROFILE 类的实例:

    CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE
      my_classification_profile(
        {
          'minimum_object_age_for_classification_days': 0,
          'maximum_classification_validity_days': 30,
          'auto_tag': true,
          'classify_views': true
        });
    
    Copy
  4. 对实例调用 DESCRIBE 方法以确认实例的属性:

    SELECT my_classification_profile!DESCRIBE();
    
    Copy
  5. 在架构上设置分类配置文件实例,启动后台进程,以监控架构中的表,并自动对其进行敏感数据分类。

    ALTER DATABASE mydb
     SET CLASSIFICATION_PROFILE = 'mydb.sch.my_classification_profile';
    
    Copy

    备注

    在架构上设置分类配置文件与 Snowflake 开始对架构进行分类之间会有一小时的延迟。

  6. 等待一小时后,调用 SYSTEM$GET_CLASSIFICATION_RESULT 存储过程,以获取自动分类的结果。

    CALL SYSTEM$GET_CLASSIFICATION_RESULT('mydb.sch.t1');
    
    Copy

示例:使用标签图和自定义分类器

  1. As an administrator, give the data engineer the roles and privileges they need to automatically classify tables in a database and set tags on columns.

  2. 创建分类配置文件。

    CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE
      my_classification_profile(
        {
          'minimum_object_age_for_classification_days': 0,
          'maximum_classification_validity_days': 30,
          'auto_tag': true,
          'classify_views': true
        });
    
    Copy
  3. 对实例调用 SET_TAG_MAP 方法,将 标签图 添加到分类配置文件。这样就可以在包含敏感数据的列上自动应用自定义标签。

    CALL my_classification_profile!SET_TAG_MAP(
      {'column_tag_map':[
        {
          'tag_name':'my_db.sch1.pii',
          'tag_value':'sensitive',
          'semantic_categories':['NAME']
        }]});
    
    Copy

    或者,您还可以在创建分类配置文件时添加此标签图。

  4. 调用 SET_CUSTOM_CLASSIFIERS 方法,将 自定义分类器 添加到分类配置文件。这样就能根据用户定义的语义和隐私类别对敏感数据进行自动分类。

    CALL my_classification_profile!set_custom_classifiers(
      {
        'medical_codes': medical_codes!list(),
        'finance_codes': finance_codes!list()
      });
    
    Copy

    或者,您还可以在创建分类配置文件时添加自定义分类器。

  5. 对实例调用 DESCRIBE 方法,确认标签图和自定义分类器已添加到分类配置文件。

    SELECT my_classification_profile!DESCRIBE();
    
    Copy
  6. Set the classification profile instance on the database.

    ALTER DATABASE mydb
     SET CLASSIFICATION_PROFILE = 'mydb.sch.my_classification_profile';
    
    Copy
  7. 将掩码策略附加到 tag_db.sch.pii 标签,启用基于标签的掩码。

    ALTER TAG tag_db.sch.pii SET MASKING POLICY pii_mask;
    
    Copy

示例:在启用自动分类前测试分类配置文件

  1. 作为管理员,向数据工程师授予所需的 角色和权限,以自动对架构中的表进行分类并为列设置标签。

  2. 使用标签图和自定义分类器创建分类配置文件:

    CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE my_classification_profile(
      {
        'minimum_object_age_for_classification_days':0,
        'auto_tag':true,
        'tag_map': {
          'column_tag_map':[
            {
              'tag_name':'tag_db.sch.pii',
              'tag_value':'highly sensitive',
              'semantic_categories':['NAME','NATIONAL_IDENTIFIER']
            },
            {
              'tag_name':'tag_db.sch.pii',
              'tag_value':'sensitive',
              'semantic_categories':['EMAIL','MEDICAL_CODE']
            }
          ]
        },
        'classify_views': true
        'custom_classifiers': {
          'medical_codes': medical_codes!list(),
          'finance_codes': finance_codes!list()
        }
      }
    );
    
    Copy
  3. 在启用自动分类之前,调用 SYSTEM$CLASSIFY 存储过程以测试 table1 表上的标记映射。

    CALL SYSTEM$CLASSIFY(
     'db.sch.table1',
     'db.sch.my_classification_profile'
    );
    
    Copy

    输出中的 tags 键包含关于标签是否已设置的详细信息(如果已设置,则为 true,否则为 false)、已设置标签的名称以及标签值:

    {
      "classification_profile_config": {
        "classification_profile_name": "db.schema.my_classification_profile"
      },
      "classification_result": {
        "EMAIL": {
          "alternates": [],
          "recommendation": {
            "confidence": "HIGH",
            "coverage": 1,
            "details": [],
            "privacy_category": "IDENTIFIER",
            "semantic_category": "EMAIL",
            "tags": [
              {
                "tag_applied": true,
                "tag_name": "snowflake.core.semantic_category",
                "tag_value": "EMAIL"
              },
              {
                "tag_applied": true,
                "tag_name": "snowflake.core.privacy_category",
                "tag_value": "IDENTIFIER"
              },
              {
                "tag_applied": true,
                "tag_name": "tag_db.sch.pii",
                "tag_value": "sensitive"
              }
            ]
          },
          "valid_value_ratio": 1
        },
        "FIRST_NAME": {
          "alternates": [],
          "recommendation": {
            "confidence": "HIGH",
            "coverage": 1,
            "details": [],
            "privacy_category": "IDENTIFIER",
            "semantic_category": "NAME",
            "tags": [
              {
                "tag_applied": true,
                "tag_name": "snowflake.core.semantic_category",
                "tag_value": "NAME"
              },
              {
                "tag_applied": true,
                "tag_name": "snowflake.core.privacy_category",
                "tag_value": "IDENTIFIER"
              },
              {
                "tag_applied": true,
                "tag_name": "tag_db.sch.pii",
                "tag_value": "highly sensitive"
              }
            ]
          },
          "valid_value_ratio": 1
        }
      }
    }
    
  4. Having verified that automatic classification based on the classification profile will have the desired result, set the classification profile instance on the database.

    ALTER DATABASE mydb
     SET CLASSIFICATION_PROFILE = 'mydb.sch.my_classification_profile';
    
    Copy

故障排除

The simplest way to start troubleshooting a table that wasn't classified is to query the table directly (for example, SELECT * FROM my_table). If a table can't be queried, it can't be automatically classified.

If an object can't be automatically classified, Snowflake logs an event to an event table. By default, the event is logged to the account-level event table. If you have an event table defined for the failed object's database, then the event is logged there instead.

In general, there is a delay before Snowflake tries to classify the object again. Every additional failed attempt is logged to the event table. This delay and retry process continues until the object is fixed or removed from automatic classification.

备注

To help avoid unnecessary costs, Snowflake waits additional time to retry classification for some errors, such as timeouts. For these timeout errors, Snowflake doesn't retry classification until all objects are reclassified; the schedule on which objects are reclassified is controlled by the maximum_classification_validity_days key of the classification profile.

If you want prevent classification events from being logged, set the ENABLE_AUTOMATIC_SENSITIVE_DATA_CLASSIFICATION_LOG account parameter to FALSE.

Listing general errors

The following query against the event table returns general errors related to automatic classification:

SELECT
  record_type,
  record:severity_text::string log_level,
  parse_json(value) error_message
  FROM <event_db>.<event_schema>.<event_table>
  WHERE record_type='LOG' and scope:name ='snow.automatic_sensitive_data_classification'
  ORDER BY log_level;
Copy

For a subset of the possible error messages returned by this query, see Tag-related error messages.

Listing object-level classification errors

The following query against the event table returns errors related to the classification of a specific object. For example, it returns errors that occurred when Snowflake tried to classify a specific table.

SELECT
  RECORD_ATTRIBUTES:"object_name"::string AS object_name,
  parse_json(value):"error_message" error_message,
  PARSE_JSON(VALUE):"profile_name" classification_profile_name,
  timestamp,
  FROM <event_db>.<event_schema>.<event_table>
  WHERE record_type='LOG'
    AND scope:name ='snow.automatic_sensitive_data_classification'
    AND RECORD_ATTRIBUTES:"event_type" = 'CLASSIFICATION_ERROR'
  ORDER BY TIMESTAMP DESC;
Copy
语言: 中文