Using custom classifiers to implement custom semantic categories¶
The CUSTOM_CLASSIFIER class allows data engineers to extend their sensitive data classification capabilities based on their own knowledge of their data. To classify sensitive data into custom semantic categories, create an instance of the CUSTOM_CLASSIFIER class in a schema and call instance methods to add regular expressions associated with the instance.
有关使用 CUSTOM_CLASSIFIER 实例创建自定义语义类别的端到端示例,请参阅 示例。
命令和方法¶
支持以下方法和 SQL 命令:
访问控制¶
下列各节总结了使用实例所需的各种对象的角色和权限。
角色¶
您可以使用以下角色进行自定义分类:
SNOWFLAKE.CLASSIFICATION_ADMIN: database role that enables you to create a custom classifier instance.
custom_classifier!PRIVACY_USER:实例角色,使您能够在实例上调用以下方法:ADD_REGEX
LIST
DELETE_CATEGORY
对拥有实例 OWNERSHIP 权限的账户角色可以运行以下命令:
DROP CUSTOM_CLASSIFIER
SHOW CUSTOM_CLASSIFIER
权限¶
要创建和管理实例,您可以选择为角色授予 CREATE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER 权限,或向角色授予 PRIVACY_USER 实例角色。
You can grant the instance roles to account roles and database roles to enable other users to work with custom classifier instances:
GRANT SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!PRIVACY_USER
TO ROLE <role_name>
REVOKE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!PRIVACY_USER
FROM ROLE <role_name>
GRANT SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!PRIVACY_USER
TO DATABASE ROLE <database_role_name>
REVOKE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!PRIVACY_USER
FROM DATABASE ROLE <database_role_name>
其中:
nameSpecifies the name of the custom classifier instance.
role_name指定账户角色的名称。
database_role_name指定数据库角色的名称。
您必须使用仓库在实例上调用方法。
要为自定义角色 my_classification_role 授予所需的实例角色和权限,以创建和使用 CUSTOM_CLASSIFIER 类的实例,请执行以下语句:
USE ROLE ACCOUNTADMIN;
GRANT DATABASE ROLE SNOWFLAKE.CLASSIFICATION_ADMIN
TO ROLE my_classification_role;
GRANT USAGE ON DATABASE mydb TO ROLE my_classification_role;
GRANT USAGE ON SCHEMA mydb.instances TO ROLE my_classification_role;
GRANT USAGE ON WAREHOUSE wh_classification TO ROLE my_classification_role;
如果您想启用特定角色(例如 data_analyst)以使用特定实例,请执行以下操作:
GRANT SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE
mydb.sch.my_instance!PRIVACY_USER TO ROLE data_analyst;
示例¶
使用自定义分类器对数据进行分类的高级方法如下:
确定要分类的表。
使用 SQL 执行以下操作:
Create a custom classifier instance.
Add the custom semantic category and regular expressions to the instance.
对表进行分类。
完成以下步骤以创建用于对表进行分类的自定义分类器:
以一个
data.tables.patient_diagnosis表为例,其中一列包含诊断代码,例如 `ICD-10 代码<https://en.wikipedia.org/wiki/ICD-10>`_。+-------------+----------------------------------------------------------+ | ICD_10_CODE | DESCRIPTION | +-------------+----------------------------------------------------------+ | G30.9 | Alzheimer's disease, unspecified | | G80.8 | Other cerebral palsy | | S13.4XXA | Sprain of ligaments of cervical spine, initial encounter | +-------------+----------------------------------------------------------+
该表还可能包括用于识别在医疗机构接受治疗的患者的列,例如名字和姓氏、唯一的健康保险标识符和出生日期。数据所有者可以对表进行分类,确保正确标记列,以便对表进行监控。
在此示例中,数据所有者已经为其角色授予了以下权限:
对表的 OWNERSHIP 权限,以便进行分类。
对包含表的架构的 OWNERSHIP 权限。
对包含架构和表的数据库的 USAGE 权限。
允许数据所有者通过将 SNOWFLAKE.CLASSIFICATION_ADMIN 数据库角色授予数据所有者角色来对表进行分类:
USE ROLE ACCOUNTADMIN; GRANT DATABASE ROLE SNOWFLAKE.CLASSIFICATION_ADMIN TO ROLE data_owner;
As the data owner, create a schema to store your custom classifier instances:
USE ROLE data_owner; CREATE SCHEMA data.classifiers;
Use the CREATE CUSTOM_CLASSIFIER command to create a custom classifier instance in the
data.classifiersschema:CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER medical_codes();
您可以选择性地更新您的 搜索路径,如下所示:
添加
SNOWFLAKE.DATA_PRIVACY,这样在创建类的新实例时,不必指定类的完全限定名称。添加
DATA.CLASSIFIERS,这样在调用实例的方法或使用与实例相关的命令时,不必指定实例的完全限定名称。
使用 SHOW CUSTOM_CLASSIFIER 命令列出您创建的每个实例。例如:
SHOW SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER;
返回:
+----------------------------------+---------------+---------------+-------------+-----------------+---------+-------------+ | created_on | name | database_name | schema_name | current_version | comment | owner | +----------------------------------+---------------+---------------+-------------+-----------------+---------+-------------+ | 2023-09-08 07:00:00.123000+00:00 | MEDICAL_CODES | DATA | CLASSIFIERS | 1.0 | None | DATA_OWNER | +----------------------------------+---------------+---------------+-------------+-----------------+---------+-------------+
对实例调用 custom_classifier!ADD_REGEX 方法来指定系统标签和正则表达式,以识别列中的 ICD-10 代码。此示例中的正则表达式匹配所有可能的 ICD-10 代码。匹配列名
ICD.*和注释的正则表达式是可选的:CALL medical_codes!ADD_REGEX( SEMANTIC_CATEGORY => 'ICD_10_CODES', PRIVACY_CATEGORY => 'IDENTIFIER', VALUE_REGEX => '[A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}', COL_NAME_REGEX => 'ICD.*', DESCRIPTION => 'Add a regex to identify ICD-10 medical codes in a column', THRESHOLD => 0.8 );
返回:
+---------------+ | ADD_REGEX | +---------------+ | ICD_10_CODES | +---------------+
小技巧
Test the regular expression before adding a regular expression to the custom classifier instance. For example:
SELECT icd_10_code FROM medical_codes WHERE icd_10_code REGEXP('[A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}');
+-------------+ | ICD-10-CODE | +-------------+ | G30.9 | | G80.8 | | S13.4XXA | +-------------+
在此查询中,仅返回与正则表达式匹配的有效值。该查询不会返回无效值,例如
xyz。有关详细信息,请参阅 字符串函数(正则表达式)。
对实例调用 custom_classifier!LIST 方法以验证添加到实例的正则表达式:
SELECT medical_codes!LIST();
返回:
+--------------------------------------------------------------------------------+ | MEDICAL_CODES!LIST() | +--------------------------------------------------------------------------------+ | { | | "ICD-10-CODES": { | | "col_name_regex": "ICD.*", | | "description": "Add a regex to identify ICD-10 medical codes in a column", | | "privacy_category": "IDENTIFIER", | | "threshold": 0.8, | | "value_regex": "[A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}" | | } | | } | +--------------------------------------------------------------------------------+
要删除类别,请对实例调用 custom_classifier!DELETE_CATEGORY 方法。
调用 SYSTEM$CLASSIFY_SCHEMA 存储过程对表进行分类。
If the instance is no longer needed, use the DROP CUSTOM_CLASSIFIER command to remove a custom classifier instance from the system:
DROP SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER data.classifiers.medical_codes;
审计自定义分类器¶
You can use the following queries to audit the creation of custom classifier instances, adding regular expressions to instances, and dropping the instance.
To audit the creation of custom classifier instances, use the following query:
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE query_text ILIKE 'create % snowflake.data_privacy.custom_classifier%';
要对将正则表达式添加到特定实例的操作进行审计,请使用以下查询并将
DB.SCH.MY_INSTANCE替换为您要审计的实例名称:SELECT QUERY_HISTORY.user_name, QUERY_HISTORY.role_name, QUERY_HISTORY.query_text, QUERY_HISTORY.query_id FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY query_history, SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY access_history, TABLE(FLATTEN(input => access_history.direct_objects_accessed)) flattened_value WHERE flattened_value.value:"objectName" = 'DB.SCH.MY_INSTANCE!ADD_REGEX' AND QUERY_HISTORY.query_id = ACCESS_HISTORY.query_id;
To audit dropping a custom classifier instance, use the following query:
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE query_text ILIKE 'drop % snowflake.data_privacy.custom_classifier%';