GENERATE_SYNTHETIC_DATA

该过程将根据输入表中的数据,从一个或多个表生成合成数据,并返回一个表,其中包含有关生成数据的指标,例如源数据和生成的数据之间的差异系数(相似度)。

This stored procedure uses the caller’s rights to generate the output table.

Read the requirements for running this procedure. If any requirements are not met, the request will fail before it starts generating data.

Learn more about synthetic data usage.

语法

SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA(<configuration_object>)

实参

configuration_object

An OBJECT that specifies the details of the request. You can use an OBJECT constant to specify this object.

OBJECT 值具有以下结构:

{
  'datasets': [
    {
      'input_table': '<input_table_name>',
      'output_table' : '<output_table_name>',
      'columns': {
        '<column_name>': {
          <property_name>: <property_value>
        }
        , ...
      }
    }
    , ...
  ],
  'similarity_filter': <boolean>,
  'replace_output_tables': <boolean>,
  'consistency_secret': <session_scoped_reference_string>
}

OBJECT 对象包含以下键值对:

datasets

An array specifying the data to generate. Each element in the array is an OBJECT value that defines a single input-output table pair. You can specify a maximum of five table pairs.

表示单个输入/输出表对的子 OBJECT 值具有以下属性:

input_table

The fully-qualified name of the input table from which to generate synthetic data. If the table does not exist or cannot be accessed, Snowflake returns an error message. See Using synthetic data in Snowflake for more input table requirements.

If the identifier contains spaces or special characters, the entire string must be enclosed in double quotes. Identifiers enclosed in double quotes are also case-sensitive.

For more information, see Identifier requirements.

output_table

The fully-qualified name of the output table to store the synthetic data generated from input_table. The generated table will have the same permissions and policies as if the user had called CREATE TABLE with default values. If the table already exists and replace_output_tables=TRUE, the existing table will be overwritten.

In addition, the identifier must start with an alphabetic character and cannot contain spaces or special characters unless the entire identifier string is enclosed in double quotes (for example, "My object"). Identifiers enclosed in double quotes are also case-sensitive.

For more information, see Identifier requirements.

columns

可选)OBJECT 值指定特定列的其他属性。OBJECT 中的每个字段都定义了单列的属性。您不需要为所有列或任何列定义属性。对于每个字段:

  • 键是列名称。值中的属性应适用于此列。
  • 该值是包含以下任意键/值对的 OBJECT 值:
    • join_key: BOOLEAN, where TRUE indicates that this is a join key column. This cannot be used in a column labeled categorical. Column must be a string, numeric, or Boolean value. Learn about join keys.

默认:FALSE。

  • categorical: BOOLEAN, used to specify whether the column is a categorical string. Set to TRUE to enable the output to mark the data as non-sensitive, and able to be used in the output. Set to FALSE to redact values from the output. If not specified, will be determined by examining the data. Can be specified only for STRING columns. If set to TRUE, you cannot specify the replace or join_key fields for this column

默认值:根据列数据推断。

  • replace: Specifies an output format for STRING values. Can be used only on categorical string columns. The only values that can be used with join_key columns are uuid and email. Cannot be used when categorical is TRUE. If specified, you must provide a value for consistency_secret. The following values are supported:

    replace values

    ValueDescription
    uuidA UUID. Example: 88d99a35-c4be-4022-b06a-41fb4629b46d
    nameA first and last name in US locale style. Example: George Washington
    first_nameA first name in US locale style. Example: George
    last_nameA last name in US locale style. Example: Washington
    addressAn abbreviated address in US locale style. Example: 1600 Pennsylvania Ave
    full_addressA detailed street address in US locale style. Example: 1600 Pennsylvania Ave NW, Washington DC 20500
    emailAn email address. Example: bdbQ6OPBS5ScOdJx8bVpFw@example.com
    phoneA US-style 10-digit phone number in US locale style. Example: 212-555-1234
    ssnA US-style Social Security number. Example: 123-45-6789

默认:

  • For join_key columns, uuid
  • 对于非联接键列,其值将被编辑。
similarity_filter

(可选)指定创建合成数据时是否使用相似性筛选器。将此项设置为 TRUE 会使用内置的隐私筛选器从目标表中移除与输入表中的行过于相似的行。如果为 FALSE,则每个输出表的行数与输入表相同;如果为 TRUE,则输出表的行数可能少于输入表。如果存在 NULL 值且列类型为非字符串列,则 TRUE 的合成数据生成将会失败。

默认:FALSE

For more information, see Enhancing privacy.

replace_output_tables

可选)指定在创建合成数据时是否覆盖输出合成数据表。将其设置为 TRUE 以覆盖输出表。

默认:FALSE

consistency_secret

Session-scoped reference STRING for a symmetric key SECRET. Required if either of the following conditions are met, (otherwise you can omit this field):

  • 若需确保多次运行中联接键的一致性。
  • If columns.replace or columns.join_key = TRUE are specified on any column, and this procedure is run in an owner’s rights stored procedure.

If you provide a secret, the procedure generates consistent values for STRING join keys across multiple runs that reuse the same consistency secret. If you provide a secret, you must have the READ or OWNERSHIP privilege on this secret.

If you don’t provide a secret, join keys are consistent between tables in the same run, but not across multiple runs. Learn more about consistency.

默认值:无一致性

输出

Column NameData TypeDescription
created_onTIMESTAMPTime the synthetic data was generated.
table_nameVARCHARName of the synthetic table.
table_schemaVARCHARSchema name of the synthetic table.
table_databaseVARCHARDatabase name of the synthetic table.
columnsVARCHARA pair of columns in the synthetic table.
source_table_idNUMBERInternal/system-generated identifier of the input table.
source_table_nameVARCHARName of the input table.
source_table_schemaVARCHARSchema name of the input table.
source_table_databaseVARCHARDatabase name of the input table.
source_columnsVARCHARNames of the source columns.
metric_typeENUM

correlation_coefficient_difference - Calculated as the absolute value of the correlation coefficient between two non-join columns in the source table and the same two columns in the generated data.

Currently, correlation_coefficient_difference is the only supported metric. This is the difference between the correlation coefficient of every combination of columns in the input table and the same coefficient in the generated data. Each row represents the correlation coefficient difference between one combination of columns. The column name pair is found in these columns: columns and source_columns.

metric_valueNUMBERValue of the metric.

访问控制要求

To generate synthetic data, you must use a role with each the following grants:

  • USAGE on the warehouse that you want to use for queries.
  • SELECT on the input table from which you want to generate synthetic data.
  • USAGE on the database and schema that contain the input table, and on the database that contains the output table.
  • CREATE TABLE on the schema that contains the output table.
  • OWNERSHIP on the output tables. The simplest way to do this is by granting OWNERSHIP to the schema where the output table is generated. (However, if someone has applied a FUTURE GRANT on this schema, table ownership will be silently overridden – that is, GRANT OWNERSHIP ON FUTURE TABLES IN SCHEMA db.my_schema TO ROLE some_role will automatically grant OWNERSHIP to some_role on any new tables created in schema my_schema.)

All users can access the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure. Access is made available using the SNOWFLAKE.CORE_VIEWER database role, which is granted to the PUBLIC role.

使用说明

  • JSON 键值必须为小写。
  • You must accept the Anaconda terms and conditions in your Snowflake account in order to enable this feature.
  • For additional requirements, see Requirements.
  • Any timestamps earlier than 1677-09-21 00:12:43.145224193 or later than 2262-04-11 23:47:16.854775807 in the source data are coerced to 1677-09-21 00:12:43.145224193 or 2262-04-11 23:47:16.854775807 respectively when generating synthetic data.

示例

此示例从包含医疗信息(血型、性别、年龄和种族)的输入表中生成合成数据。响应显示了源表和生成表之间数据的接近程度。未显示生成的合成数据表。

指定两列作为联接键

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
    'datasets':[
        {
          'input_table': 'syndata_db.sch.faker_source_t',
          'output_table': 'syndata_db.sch.faker_synthetic_t',
          'columns': { 'blood_type': {'join_key': TRUE} , 'ethnicity': {'join_key': TRUE}}
        }
      ]
  });

没有指定列作为联接键

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
  'datasets':[
      {
        'input_table': 'syndata_db.sch.faker_source_t',
        'output_table': 'syndata_db.sch.faker_synthetic_t'
      }
    ]
});

使用一致性键在多次运行中生成一致的值

CREATE OR REPLACE SECRET my_db.public.my_consistency_secret
  TYPE = SYMMETRIC_KEY
  ALGORITHM = GENERIC;

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
  'datasets':[
      {
        'input_table': 'CLINICAL_DB.PUBLIC.BASE_TABLE',
        'output_table': 'my_db.public.test_syndata',
        'columns': { 'patient_id': {'join_key': TRUE, 'replace': 'uuid'}}
      }
    ],
    'consistency_secret': SYSTEM$REFERENCE('SECRET', 'MY_CONSISTENCY_SECRET', 'SESSION', 'READ')::STRING,
    'replace_output_tables': TRUE
});

调用函数的输出

+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+
| CREATED_ON                | TABLE_NAME        | TABLE_SCHEMA | TABLE_DATABASE | COLUMNS                | SOURCE_TABLE_NAME | SOURCE_TABLE_SCHEMA | SOURCE_TABLE_DATABASE | SOURCE_COLUMNS         | METRIC_TYPE                        | METRIC_VALUE   |
+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "BLOOD_TYPE,GENDER"    | faker_source_t    | sch                 | syndata_db            | "BLOOD_TYPE,GENDER"    | CORRELATION_COEFFICIENT_DIFFERENCE | 0.02430214616  |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "BLOOD_TYPE,AGE"       | faker_source_t    | sch                 | syndata_db            | "BLOOD_TYPE,AGE"       | CORRELATION_COEFFICIENT_DIFFERENCE | 0.001919343586 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "BLOOD_TYPE,ETHNICITY" | faker_source_t    | sch                 | syndata_db            | "BLOOD_TYPE,ETHNICITY" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.003720197046 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "GENDER,AGE"           | faker_source_t    | sch                 | syndata_db            | "GENDER,AGE"           | CORRELATION_COEFFICIENT_DIFFERENCE | 0.004348586645 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "GENDER,ETHNICITY"     | faker_source_t    | sch                 | syndata_db            | "GENDER,ETHNICITY"     | CORRELATION_COEFFICIENT_DIFFERENCE | 0.001171535243 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "AGE,ETHNICITY"        | faker_source_t    | sch                 | syndata_db            | "AGE,ETHNICITY"        | CORRELATION_COEFFICIENT_DIFFERENCE | 0.004265938158 |
+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+