隐去个人身份信息 (PII)

PII includes names, addresses, phone numbers, email addresses, tax identification numbers, and other data that can be used (alone or with other information) to identify an individual. Most organizations have regulatory and compliance requirements around handling PII data. AI_REDACT is a fully-managed Cortex AI Function that helps you help redact PII from unstructured text data, using a large language model (LLM) hosted by Snowflake to identify PII and replace it with placeholder values.

AI_REDACT 可以帮助您为呼叫中心指导、情绪分析、保险和医疗分析以及 ML 模型训练等用例准备文本。

小技巧

使用 AI_PARSE_DOCUMENT 或 AI_TRANSCRIBE,在应用 AI_REDACT 之前将文档或语音数据转换为文本。

AI_REDACT

AI_REDACT 函数将输入文本中的个人身份信息 (PII) 替换为占位符值。

重要

AI_REDACT 使用 AI 模型对敏感信息进行尽力隐去处理。请务必查看输出,以确保符合组织的数据隐私策略。如果 AI_REDACT 无法隐去数据中的任何 PII,请告知 Snowflake。

区域可用性

请参阅 Regional availability

限制

  • 使用 AI 模型进行隐去处理,可能无法找到所有个人身份信息。请务必查看输出,以确保符合组织的数据隐私策略。如果 AI_REDACT 无法隐去某些 PII,请联系 Snowflake 支持部门。

  • COUNT_TOKENS 和 AI_COUNT_TOKENS 函数尚不支持 AI_REDACT。

  • 此时,AI_REDACT 最适合格式正确的英文文本。性能可能因其他语言或存在许多拼写、标点符号或语法错误的文本而异。

  • AI_REDACT currently redacts only US PII and some UK and Canadian PII, where noted in 检测到 PII 类别.

  • AI_REDACT is currently limited in the number of tokens it can input and output. Input and output together can be up to 4,096 tokens. Output is limited to 1,024 tokens. If the input text is longer, split it into smaller chunks and redact each chunk separately, perhaps using SPLIT_TEXT_RECURSIVE_CHARACTER. See Chunking example for an example of redacting text that exceeds token limits.

    备注

    词元是 AI 模型处理的最小数据单位。对于英文文本,行业准则认为一个词元约为 4 个字符,即 0.75 个单词。

检测到 PII 类别

AI_REDACT supports redacting the following categorise of PII. The values in the Category column are the strings supported in the optional categories argument.

类别

备注

NAME

Recognizes full name, first name, middle name, and last name

EMAIL

PHONE_NUMBER

DATE_OF_BIRTH

GENDER

Recognizes male, female, and nonbinary

AGE

ADDRESS

Identifies:

  • complete postal address (US, UK, CA)

  • street address (US, UK, CA)

  • postal code (US, UK, CA)

  • city (US, UK, CA)

  • 州 (US) 或省 (CA)

  • county, borough, or township (US)

NATIONAL_ID

Identifies Social Security numbers (US)

PASSPORT

Identifies passport numbers (US, UK, CA)

TAX_IDENTIFIER

识别个人纳税人编号 (ITNs)

PAYMENT_CARD_DATA

Identifies complete card information, card number, expiration date, and CVV

DRIVERS_LICENSE

Supported US, UK, CA

IP_ADDRESS

备注

AI_REDACT 支持对某些 PII 类别的部分匹配。例如,仅使用名字即可通过 [NAME] 占位符触发信息隐去处理。

Error handling

Ordinarily, AI_REDACT raises an error if it cannot process the input text. When a query redacts multiple rows, an error causes the entire query to fail. To allow processing to continue with other rows, you can set the session parameter AI_SQL_ERROR_HANDLING_USE_FAIL_ON_ERROR to FALSE. Errors then return NULL instead of stopping the query.

ALTER SESSION SET AI_SQL_ERROR_HANDLING_USE_FAIL_ON_ERROR=FALSE;
Copy

With this parameter set to FALSE, you can also pass TRUE as the final argument to AI_REDACT, which causes the return value to be an OBJECT that contains separate fields for the redacted text and any error message. One of these fields is NULL depending on whether the AI_REDACT call processed successfully.

成本注意事项

AI_REDACT incurs costs based on the number of input and output tokens processed, as with other Cortex AI Functions. See the Snowflake Pricing Guide for details.

示例

基本示例

以下示例从输入文本中隐去姓名和地址。

SELECT AI_REDACT(
    input => 'My name is John Smith and I live at twenty third street, San Francisco.'
);
Copy

输出:

My name is [NAME] and I live at [ADDRESS]

The following example redacts only names and email addresses from the input text. Note that the text only contains a first name, which is recognized and redacted as [NAME]. The input text does not contain an email address, so no email placeholder appears in the output.

SELECT AI_REDACT(
    input => 'My name is John and I live at twenty third street, San Francisco.',
    categories => ['NAME', 'EMAIL']
);
Copy

输出:

My name is [NAME] and I live at twenty third street, San Francisco.

端到端示例

以下示例对一个表中的行进行处理,并将隐去敏感信息后的结果插入到另一个表中。您可以使用类似的方法,将隐去敏感信息的数据存储到现有表的某一列中。

隐去敏感信息后,文本将被传递到 AI_SENTIMENT 以提取整体情绪信息。

-- Create a table with unredacted text
CREATE OR REPLACE TABLE raw_table AS
  SELECT 'My previous manager, Washington, used to live in Kirkland. His first name was Mike.' AS my_column
  UNION ALL
  SELECT 'My name is William and I live in San Francisco. You can reach me at (415).450.0973';

-- view unredacted data
SELECT * FROM raw_table;

-- Create a redaction table
CREATE OR REPLACE TABLE redaction_table (
  value VARCHAR
);

-- Redact PII from raw_table and insert into redaction_table
INSERT INTO redaction_table
SELECT AI_REDACT(my_column) AS value FROM raw_table;

-- view redacted results
SELECT * FROM redaction_table;

-- Run AI_SENTIMENT on redacted text
SELECT
    value AS redacted_text,
    AI_SENTIMENT(value) AS summary_sentiment
FROM redaction_table;
Copy

错误处理示例

本示例基于前例,演示如何在使用 AI_REDACT 处理多行数据时处理错误。该示例设置会话参数 AI_SQL_ERROR_HANDLING_USE_FAIL_ON_ERROR,并将 TRUE 作为 AI_REDACT 的最后一个实参传递。此操作将使函数返回一个包含两个独立字段的 OBJECT:一个存储经过隐去处理后的文本,另一个存储错误信息(根据函数执行结果,其中一个字段将为 NULL)。

ALTER SESSION SET AI_SQL_ERROR_HANDLING_USE_FAIL_ON_ERROR=FALSE;

-- Create a redaction table with columns for value and error message
CREATE OR REPLACE TABLE redaction_table (
  value VARCHAR,
  error VARCHAR
);

-- Redact PII from raw_table and insert into redaction_table
-- Both the redacted text and any error message are stored
INSERT INTO redaction_table
SELECT
  result:value::STRING AS value,
  result:error::STRING AS error
  FROM (SELECT AI_REDACT(my_column, TRUE) AS result FROM raw_table);
Copy

Chunking example

本示例演示如何通过将长文本拆分为较小块、分别对每块进行隐去处理,再将经过隐去处理后的块重新组合为最终输出,从而实现对 PII 的隐去处理。此方法可规避 AI_REDACT 的词元限制。

CREATE OR REPLACE TABLE patients (
    patient_id INT PRIMARY KEY,
    patient_notes text
);

CREATE OR REPLACE TABLE final_temp_table AS
WITH chunked_data AS (
    -- Step 1: Split text into chunks
    SELECT
        patient_id,
        chunk.value AS chunk_text,
        chunk.index AS chunk_index
    FROM
        patients,
        LATERAL FLATTEN(
            input => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(
                patient_notes,
                'none',
                1000
            )
        ) AS chunk
    WHERE
        patient_notes IS NOT NULL
        AND LENGTH(patient_notes) > 0
),
redacted_chunks AS (
    -- Step 2: Apply AI_REDACT to each chunk
  SELECT
  patient_id,
        chunk_index,
        chunk_text,
        TO_VARIANT(results:value) AS redacted_chunk,
        TO_VARIANT(results:error) AS error_string
        from (
    SELECT
        patient_id,
        chunk_index,
        chunk_text,
        AI_REDACT(chunk_text,TRUE) AS results
    FROM
        chunked_data
)
),

-- Step 3: Concatenate redacted chunks
final AS (
SELECT
chunk_text as original,
IFF(error_string IS NOT NULL, chunk_text, redacted_chunk) AS redacted_text,
patient_id,
chunk_index
FROM
    redacted_chunks
)
SELECT * FROM final;

SELECT
  patient_id,
  LISTAGG(redacted_text, '') WITHIN GROUP (ORDER BY chunk_index) AS full_output
  FROM final_temp_table
  GROUP BY patient_id;
Copy
语言: 中文