自定义 Clean Room 模板参考

关于 Clean Room 模板

Clean room templates are written in JinjaSQL (https://github.com/sripathikrishnan/jinjasql). JinjaSQL is an extension to the Jinja templating language that generates a SQL query as output. This allows templates to use logic statements and run-time variable resolution to let the user specify table names, table columns, and custom values used in the query at run time.

Snowflake provides some pre-designed templates for common use cases. However, most users prefer to create custom query templates for their clean rooms. Custom templates are created using the clean rooms API, but can be run either in code or using the clean rooms UI.

模板一般有两种类型:

  • Analysis templates, which evaluate to a SELECT statement (or a set of SELECT operations) that show results to the template runner.

  • Activation templates, which are used to activate results to a Snowflake account or a third-party, rather than showing results in the immediate environment. An activation template is very similar to an analysis template with a few extra requirements.

    In the clean rooms UI, an analysis template can be associated with an activation template to enable the caller to run an analysis, see results, and then activate data to themselves or a third party. The activation template does not need to resolve to the same query as the associated analysis template.

创建和运行自定义模板

In a clean room with default settings, the provider adds a template to a clean room and the consumer runs the template, as described in the custom template usage documentation.

快速示例

下面是一个简单的 SQL 示例,它通过电子邮件加入提供商和使用者表,并显示每个城市的重叠计数:

SELECT COUNT(*), city FROM consumer_table
  INNER consumer_table
  ON consumer_table.hashed_email = provider_table.hashed_email
  GROUP BY city;
Copy

Here is how that query would look as a JinjaSQL template that allows the caller to choose the JOIN and GROUP BY columns, as well as the tables used:

SELECT COUNT(*), IDENTIFIER({{ group_by_col | column_policy }})
  FROM IDENTIFIER({{ my_table[0] }}) AS c
  INNER JOIN IDENTIFIER({{ source_table[0] }}) AS p
  ON IDENTIFIER({{ consumer_join_col | join_policy }}) = IDENTIFIER({{ provider_join_col | join_policy }})
  GROUP BY IDENTIFIER({{ group_by_col | column_policy }});
Copy

模板备注:

  • {{双括号对}} 内的值是自定义变量。group_by_colmy_tablesource_tableconsumer_join_colprovider_join_colgroup_by_col 都是调用方填充的自定义变量。

  • source_tablemy_table 是由调用方填充的 Snowflake 定义的字符串数组变量。数组成员是链接到 Clean Room 的提供商表和使用者表的完全限定名称。调用方指定每个数组中应该包括哪些表。

  • Provider tables must be aliased as lowercase p and consumer tables as lowercase c in a template. If you have multiple tables, you can index them as p1, p2, c1, c2, and so on.

  • 所有列名和表名都需要 IDENTIFIER,因为 {{双括号}} 中的变量计算为字符串字面量,而其并非有效标识符。

  • JinjaSQL filters can be applied to variables to enforce any join or column policies set by either side. Snowflake implements custom filters join_policy and column_policy, which verify whether a column complies with join or column policies in the clean room respectively, and fail the query if it does not. A filter is applied to a column name as {{ column_name | filter_name }}.

所有这些要点将在后面详细讨论。

下面介绍使用者如何在代码中运行此模板。请注意如何通过模板中声明的表别名来限定列名。

CALL SAMOOHA_BY_SNOWFLAKE_LOCAL_DB.CONSUMER.RUN_ANALYSIS(
  $cleanroom_name,
  $template_name,
  ['my_db.my_sch.consumer_table],       -- Populates the my_table variable
  ['my_db.my_sch.provider_table'],      -- Populates the source_table variable
  OBJECT_CONSTRUCT(                     -- Populates custom named variables
    'consumer_join_col','c.age_band',
    'provider_join_col','p.age_band',
    'group_by_col','p.device_type'
  )
);
Copy

To be able to use this template in the clean rooms UI, the provider must create a custom UI form for the template. The UI form has named form elements that correspond to template variable names, and the values provided in the form are passed into the template.

开发自定义模板

Clean Room 模板就是 JinjaSQL 模板。要创建模板,您应该熟悉以下主题:

  • Jinja templating basics (https://jinja.palletsprojects.com/en/stable/)

  • The JinjaSQL extension to Jinja (https://github.com/sripathikrishnan/jinjasql).

Use the consumer.get_jinja_sql procedure to test the validity of your template, then run the rendered template to see that it produces the results that you expect. Note that this procedure doesn't support clean room filter extensions, such as join_policy, so you must test your template without those filters, and add them later.

示例:

-- Template to test
SELECT {{ col1 | sqlsafe }}, {{ col2 | sqlsafe }}
  FROM IDENTIFIER({{ source_table[0] }}) AS p
  JOIN IDENTIFIER({{ my_table[0] }}) AS c
  ON {{ provider_join_col | sqlsafe }} = {{ consumer_join_col | sqlsafe}}
  {% if where_phrase %} WHERE {{ where_phrase | sqlsafe}}{% endif %};

-- Render the template.
USE WAREHOUSE app_wh;
USE ROLE SAMOOHA_APP_ROLE;

CALL SAMOOHA_BY_SNOWFLAKE_LOCAL_DB.CONSUMER.GET_SQL_JINJA(
$$
SELECT {{ col1 | sqlsafe }}, {{ col2 | sqlsafe }}
  FROM IDENTIFIER({{ source_table[0] }}) AS p
  JOIN IDENTIFIER({{ my_table[0] }}) AS c
  ON IDENTIFIER({{ provider_join_col }}) = IDENTIFIER({{ consumer_join_col }})
  {% if where_phrase %} WHERE {{ where_phrase | sqlsafe }}{% endif %};
  $$,
  object_construct(
'col1', 'c.status',
'col2', 'c.age_band',
'where_phrase', 'p.household_size > 2',
'consumer_join_col', 'c.age_band',
'provider_join_col', 'p.age_band',
'source_table', ['SAMOOHA_SAMPLE_DATABASE.DEMO.CUSTOMERS'],
'my_table', ['SAMOOHA_SAMPLE_DATABASE.DEMO.CUSTOMERS']
));
Copy

The rendered template looks like this:

SELECT c.status, c.age_band
  FROM IDENTIFIER('SAMOOHA_SAMPLE_DATABASE.DEMO.CUSTOMERS') AS p
  JOIN IDENTIFIER('SAMOOHA_SAMPLE_DATABASE.DEMO.CUSTOMERS') AS c
  ON p.age_band = c.age_band
  WHERE p.household_size > 2;

Try running the SQL statement above in your environment to see if it works, and gets the expected results.

Then test your template without a WHERE clause:

-- Render the template without a WHERE clause
CALL SAMOOHA_BY_SNOWFLAKE_LOCAL_DB.CONSUMER.GET_SQL_JINJA(
$$
SELECT {{ col1 | sqlsafe }}, {{ col2 | sqlsafe }}
  FROM IDENTIFIER({{ source_table[0] }}) AS p
  JOIN IDENTIFIER({{ my_table[0] }}) AS c
  ON {{ provider_join_col | sqlsafe }} = {{ consumer_join_col | sqlsafe}}
  {% if where_phrase %} WHERE {{ where_phrase | sqlsafe }}{% endif %};
  $$,
  object_construct(
'col1', 'c.status',
'col2', 'c.age_band',
'consumer_join_col', 'c.age_band',
'provider_join_col', 'p.age_band',
'source_table', ['SAMOOHA_SAMPLE_DATABASE.DEMO.CUSTOMERS'],
'my_table', ['SAMOOHA_SAMPLE_DATABASE.DEMO.CUSTOMERS']
));
Copy

Rendered template:

SELECT c.status, c.age_band
  FROM IDENTIFIER('SAMOOHA_SAMPLE_DATABASE.DEMO.CUSTOMERS') AS p
  JOIN IDENTIFIER('SAMOOHA_SAMPLE_DATABASE.DEMO.CUSTOMERS') AS c
  ON p.age_band = c.age_band
  ;

Add the policy filters to the template, and add the template to your clean room:

CALL samooha_by_snowflake_local_db.provider.add_custom_sql_template(
    $cleanroom_name,
    'simple_template',
    $$
    SELECT {{ col1 | sqlsafe | column_policy }}, {{ col2 | sqlsafe | column_policy }}
      FROM IDENTIFIER({{ source_table[0] }}) AS p
      JOIN IDENTIFIER({{ my_table[0] }}) AS c
      ON {{ provider_join_col | sqlsafe | join_policy }} = {{ consumer_join_col | sqlsafe | join_policy }}
      {% if where_phrase %} WHERE {{ where_phrase | sqlsafe }}{% endif %};
    $$,
);
Copy

数据保护

模板只能访问提供商和使用者链接到 Clean Room 的数据集。

Both the provider and consumer can set join, column, and activation policies on their data to protect which columns can be joined on, projected, or activated; however, the template must include the appropriate JinjaSQL policy filter on a column for the policy to be applied.

自定义模板语法

Snowflake Data Clean Room 支持 V3 JinjaSQL,并提供如前所述的一些扩展。

本部分包括以下主题:

模板命名规则

When creating a template, names must be all lowercase letters, numbers, spaces, or underscores. Activation templates (except for consumer-run provider activation) must have a name beginning with activation_. Template names are assigned when you call provider.add_custom_sql_template or consumer.create_template_request.

有效名称示例:

  • my_template

  • activation_template_1

无效名称示例:

  • my template - 不允许存在空格

  • My_Template - 只允许使用小写模板

模板变量

模板调用方可以将值传递给模板变量。JinjaSQL 语法允许对 {{ double_brackets }} 内的任何变量名进行变量绑定,但 Snowflake 保留了一些变量名,不应替换,如下所述。

小心

所有变量,无论是 Snowflake 定义的还是自定义的,都是由用户填充的,因此应谨慎对待。Snowflake Data Clean Room 模板必须解析为单个 SELECT 语句,但您仍应记住所有变量均由调用方传入。

Snowflake 定义的变量

所有 Clean Room 模板都可以访问由 Snowflake 定义但由调用方传入的以下全局变量:

source_table:

一个从零开始的字符串数组,表示 Clean Room 中可由模板使用的提供商关联表和视图。表名是完全限定的,例如:my_db.my_sch.provider_customers

例如: SELECT col1 FROM IDENTIFIER({{ source_table[0] }}) AS p;

my_table:

一个从零开始的字符串数组,表示 Clean Room 中可由模板使用的使用者表和视图。表名是完全限定的,例如:my_db.my_sch.consumer_customers

例如: SELECT col1 FROM IDENTIFIER({{ my_table[0] }}) AS c;

privacy:

A set of privacy-related values associated with users and templates. See the list of available child fields. These values can be set explicitly for the user, but you might want to set default values in the template. Access the child fields directly in your template, such as privacy.threshold.

示例: 以下是模板的示例片段,该模板使用 threshold_value 在聚合子句中强制执行最小组规模。

SELECT
  IFF(a.overlap > ( {{ privacy.threshold_value | default(2)  | sqlsafe }} ),
                    a.overlap,1 ) AS overlap,
  c.total_count AS total_count
  ...
Copy
measure_column:

dimensions:

where_clause:

Legacy clean room global variables. They are no longer recommended for use, but are still defined and appear in some legacy templates and documentation, so you should not alias tables or columns using either of these names to avoid naming collisions.

If your template uses measure_column or dimensions, the column policy is checked against any columns passed into these variables.

If your template uses a where_clause that has a join condition (for example, table1.column1 = table2.column2), the join policy is checked against any columns named there; otherwise, the column policy is checked against any columns named there.

自定义变量

模板创建者可以在模板中包含可由调用方填充的任意变量。除了 Snowflake 定义的变量或表别名之外,这些变量可以具有任何符合 Jinja 规范的任意名称。如果您希望模板在 Clean Room UI 中可用,还必须为 Clean Room UI 用户提供 UI 表单。对于 API 用户,您应为必填变量和可选变量提供完善的文档。

自定义变量可由模板访问,如这里所示的自定义变量 max_income

SELECT income FROM my_db.my_sch.customers WHERE income < {{ max_income }};
Copy

用户可以通过两种不同方式将变量传递给模板:

  • 在 Clean Room UI 中, 通过模板开发者创建的 UI 表单选择或提供值。此 UI 表单包含表单元素,用户可以在其中为您的模板提供值。表单元素的名称为变量的名称。模板仅使用表单元素的名称来访问值。使用 UIprovider.add_ui_form_customizations:ref:`<dcr_provider_add_ui_form_customizations> 创建 ` 表单。

  • 在代码中, 使用者调用 consumer.run_analysis 并将表名作为实参数组传入,将自定义变量作为名称-值对传入 analysis_arguments 实参。

备注

If you need to access user-provided values in any custom Python code uploaded to the clean room, you must explicitly pass variable values in to the code through Python function arguments; template variables are not directly accessible within the Python code using {{jinja variable binding syntax}}.

正确解析变量

传入模板的字符串值解析为最终模板中的字符串字面量。如果绑定变量处理不当,可能会导致 SQL 解析或逻辑错误:

  • SELECT {{ my_col }} FROM P; - This resolves to SELECT 'my_col' from P; which simply returns the string "my_col" - probably not what you want.

  • SELECT age FROM {{ my_table[0] }} AS P; - This resolves to SELECT age FROM 'somedb.somesch.my_table' AS P;, which causes a parsing error because a table must be an identifier, not a literal string.

  • SELECT age FROM IDENTIFIER({{ my_table[0] }}) AS P {{ where_clause }}; - Passing in "WHERE age < 50" evaluates to SELECT age FROM mytable AS P 'WHERE age < 50';, which is a parsing error because of the literal string WHERE clause.

因此,在适当情况下,您必须解析变量。下面介绍如何在模板中正确解析变量:

Resolving table and column names

指定表或列名称的变量必须通过以下两种方式之一转换为模板中的标识符:

  • IDENTIFIER:例如:SELECT IDENTIFIER({{ my_column }}) FROM P;

  • sqlsafe (https://github.com/sripathikrishnan/jinjasql?tab=readme-ov-file#sql-safe-strings):此 JinjaSQL 筛选器将标识符字符串解析为 SQL 文字。与前一个项目符号等效的语句是 SELECT {{ my_column | sqlsafe }} FROM P;

您的特定用法决定了何时使用 IDENTIFIER 或 sqlsafe。例如,c.{{ my_column | sqlsafe }} 不易使用 IDENTIFIER 重写。

Resolving dynamic SQL

当有字符串变量用作字面量 SQL 时,例如 WHERE 子句,请使用模板中的 sqlsafe 筛选器。例如:

SELECT age FROM IDENTIFIER({{ my_table[0] }}) AS C WHERE {{ where_clause }};
Copy

如果用户将“age < 50”传递给 where_clause,查询将解析为 SELECT age FROM sometable AS C WHERE 'age < 50';,由于字面量字符串 WHERE 条件导致其为无效 SQL。在这种情况下,您应该使用 sqlsafe 筛选器:

SELECT age FROM IDENTIFIER( {{ my_table[0] }} ) as c {{ where_clause | sqlsafe }};
Copy

必填表别名

At the top level of your query, all tables or subqueries must be aliased as either p (for provider-tables) or c (for consumer tables) in order for Snowflake to validate join and column policies correctly in the query. Any column that must be verified against join or column policies must be qualified with the lowercase p or c table alias. (Specifying p or c tells the back end whether to validate a column against the provider or the consumer policy respectively.)

If you use multiple provider or consumer tables in your query, add a numeric, sequential 1-based suffix to each table alias after the first. So: p, p1, p2, and so on for the first, second, and third provider tables, and c, c1, c2, and so on for the first, second, and third consumer tables. The p or c index should be sequential without gaps (that is, create the aliases p, p1, and p2, not p, p2, and p4).

示例

SELECT p.col1 FROM IDENTIFIER({{ source_table[0] }}) AS P
UNION
SELECT p1.col1 FROM IDENTIFIER({{ source_table[1] }}) AS P1;
Copy

Custom clean room template filters

Snowflake 支持所有 标准 Jinja 筛选器 (https://jinja.palletsprojects.com/en/stable/templates/#builtin-filters) 和大多数标准 JinjaSQL 筛选器 (https://github.com/search?q=repo%3Asripathikrishnan%2Fjinjasql+self.env.filters+path%3Ajinjasql%2Fcore.py&type=code&path+jinjasql%2Fcore.py=),以及一些扩展:

  • join_policy: Succeeds if the column is in the join policy of the data owner; fails otherwise.

  • column_policy: Succeeds if the column is in the column policy of the data owner; fails otherwise.

  • activation_policy: Succeeds if the column is in the activation policy of the data owner; fails otherwise.

  • join_and_column_policy: Succeeds if the column is in the join or column policy of the data owner; fails otherwise.

  • Snowflake 模板 不支持 identifier JinjaSQL 筛选器。

小技巧

JinjaSQL 语句是从左到右评估:

  • {{ my_col | column_policy }} 正确

  • {{ my_col | sqlsafe | column_policy }} 正确

  • {{ column_policy | my_col }} 不正确

  • {{ my_col | column_policy | sqlsafe }} 不正确: column_policy 将对照字符串形式的 my_col 值进行检查,这是一个错误。

强制实施 Clean Room 策略

Clean rooms do not automatically check clean room policies against columns used in a template. If you want to enforce a policy against a column:

  • You must apply the appropriate policy filter to that column in the template. For example:

JOIN IDENTIFIER({{ source_table[0] }}) AS p
  ON IDENTIFIER({{ c_join_col | join_policy }}) = IDENTIFIER({{ p_join_col | join_policy }})
Copy

Policies are checked only against columns owned by other collaborators; policies are not checked for your own data.

Note that column names cannot be ambiguous when testing policies. So if you have columns with the same name in two tables, you must qualify the column name in order to test the policy against that column.

运行自定义 Python 代码

模板可以运行上传到 Clean Room的 Python 代码。模板可以调用 Python 函数,该函数接受一行数据中的值,并返回值以在查询中使用或投射。

  • 提供商 将自定义的 Python 代码上传到 Clean Room 时,模板会使用语法 cleanroom.function_name 调用 Python 函数。更多详细信息请点击此处。

  • 使用者 将自定义 Python 代码上传到 Clean Room 时,模板会直接使用传递给 consumer.generate_python_request_template 的裸 function_name 值调用该函数(不会像提供商代码那样限定在 cleanroom)。更多详细信息请点击此处。

提供商代码示例:

-- Provider uploads a Python function that takes two numbers and returns the sum.
CALL samooha_by_snowflake_local_db.provider.load_python_into_cleanroom(
  $cleanroom_name,
  'simple_addition',                        -- Function name to use in the template
  ['someval integer', 'added_val integer'], -- Arguments
  [],                                       -- No packages needed
  'integer',                                -- Return type
  'main',                                   -- Handler for function name
  $$

def main(input, added_val):
  return input + int(added_val)
    $$
);

-- Template passes value from each row to the function, along with a
-- caller-supplied argument named 'increment'
CALL samooha_by_snowflake_local_db.provider.add_custom_sql_template(
    $cleanroom_name,
    'simple_python_example',
$$
    SELECT val, cleanroom.simple_addition(val, {{ increment | sqlsafe }})
    FROM VALUES (5),(8),(12),(39) AS P(val);
$$
);
Copy

安全注意事项

A clean room template is not executed with the identity of the current user.

用户不能直接访问 Clean Room 内的任何数据;所有访问都是通过模板结果,借助原生应用程序进行的。

Apply a policy filter any time a column is used in your template to ensure that your policies, and the policies of all collaborators, are respected.

Wrap user-provided variables with IDENTIFIER() when possible to strengthen your templates against SQL injection attacks.

激活模板

模板还可用于将查询结果保存到 Clean Room 外的表中;这称为 激活。目前,自定义模板支持的激活形式仅限提供商激活和使用者激活(分别将结果存储到提供商或使用者的 Snowflake 账户)。了解如何实施激活。

激活模板是具有以下附加要求的分析模板:

  • 激活模板是计算结果为 SQL 脚本块的 JinjaSQL 语句,可以是简单的 SELECT 语句,这点与分析模板不同。

  • Activation templates create a table in the clean room to store results, and return the table name (or a fragment of the name) to the template caller.

  • 脚本块应以一条 RETURN 语句结尾,该语句返回生成的表的名称,减去任何 cleanroom.cleanroom.activation_data_ 前缀。

  • The name of the template, the name of the internal table that the template creates, and the table name the template returns follow these patterns:

Activation type

Template name prefix

Table name prefix

Returned table name

Consumer-run consumer

activation_

cleanroom.activation_data_*

Table name without prefix

Consumer-run provider

No prefix required

cleanroom.activation_data_*

Table name without prefix

Provider-run provider

activation_

cleanroom.temp_result_data is the full table name.

temp_result_data

  • 任何被激活的列都必须在已链接数据的提供商或使用者的 激活策略 中列出,并且应对其应用 activation_policy 筛选器。请注意,列既可以是激活列,也可以是联接列。

  • 如果要在 Clean Room UI 中运行模板,您应 提供一个 Web 表单,其中包含 activation_template_nameenabled_activations 字段。在 UI 中使用的模板必须同时具有分析模板和关联的激活模板。

  • 由于表正在生成,因此所有计算的列都必须显式使用别名,而不是推断的名称。即:

    SELECT COUNT(*), p.status from T AS P; FAILS, because the COUNT column name is inferred.

    SELECT COUNT(*) AS COUNT_OF_ITEMS, p.status from T AS P; SUCCEEDS, because it explicitly aliases the COUNT column.

以下是两个基本激活模板示例。一个用于提供商运行的服务器激活,另一个用于其他激活类型。它们的不同之处在于两个突出显示的行,这两行包含结果表名。

Table must be named cleanroom.temp_result_data:

BEGIN
  CREATE OR REPLACE TABLE cleanroom.temp_result_data AS
    SELECT COUNT(c.status) AS ITEM_COUNT, c.status, c.age_band
      FROM IDENTIFIER({{ my_table[0] }}) AS c
    JOIN IDENTIFIER({{ source_table[0] }}) AS p
      ON {{ c_join_col | sqlsafe | activation_policy }} = {{ p_join_col | sqlsafe | activation_policy }}
    GROUP BY c.status, c.age_band
    ORDER BY c.age_band;
  RETURN 'temp_result_data';
END;
Copy

后续步骤

掌握模板系统后,请阅读使用模板类型实现 Clean Room 的详细说明:

  • 提供商模板 是提供商编写的模板。这是默认用例。

  • 使用者模板 是使用者编写的模板。在某些情况下,Clean Room 创建者希望让使用者能够创建、运行自己的模板并将其上传至 Clean Room。

  • 激活模板 在成功运行后创建一个结果表。根据激活模板,结果表可以保存到 Clean Room 之外的提供商或使用者账户,也可以发送到 Activation Hub 中所列的第三方激活提供商。

  • 链接模板 允许您将多个模板链接在一起,每个模板的输出由链中的下一个模板使用。

更多信息

  • Jinja 文档 (https://jinja.palletsprojects.com/en/stable/)

  • JinjaSQL 文档 (https://github.com/sripathikrishnan/jinjasql)

语言: 中文