Design custom templates¶
关于 Clean Room 模板¶
Clean room templates are written in JinjaSQL (https://github.com/sripathikrishnan/jinjasql). JinjaSQL is an extension to the Jinja templating language. A JinjaSQL template evaluates to a SQL statement when run in a clean room. The JinjaSQL templating language provides logic statements and run-time variable replacement, which enables the template to be customized at run time. For example, a user can provide table and column names when they run the template, and the template can adjust itself based on the values passed in.
模板一般有两种类型:
Analysis templates, which evaluate to a SQL DQL statement (a SELECT statement) that returns query results immediately to the template runner.
Activation templates, which are used to activate results to a Snowflake account, rather than showing results in the immediate environment. An activation template is very similar to an analysis template with a few extra requirements, and it evaluates to a DDL statement (CREATE TABLE).
Creating, sharing, and running a custom template¶
Any collaborator can register and share templates with specific analysis runners in a collaboration.
Let's start by looking at a simple SQL query, and how it would be written as a template.
1. The JinjaSQL template¶
Here is a simple SQL query that joins two tables by email and shows the overlap count per city:
Here is how that query would look as a JinjaSQL template that allows the caller to choose the JOIN and GROUP BY columns, as well as the tables used. The template includes some filters that enforce Snowflake Data Clean Room policies.
模板备注:
Values within {{ double bracket pairs }} are variables. The values are populated by the caller.
group_by_col,source_table,p1_join_col, andp2_join_colare all variables populated by the caller. These variables have arbitrary names chosen by the template designer.source_tableis a standard Snowflake-defined variable. This variable defines the views to use in the query. These views are datasets within data offerings that are linked into the clean room. Collaborators can list available datasets by calling VIEW_DATA_OFFERINGS.A dataset must be aliased as lowercase
pif you want to enforce Snowflake Data Clean Room policies on it. If a template uses multiple datasets, the first isporp1, and additional datasets are indexed asp2,p3, and so on.所有列名和表名都需要 IDENTIFIER,因为 {{双括号}} 中的变量计算为字符串字面量,而其并非有效标识符。
JinjaSQL filters are applied to columns to enforce Snowflake Data Clean Room policies on the column. Snowflake implements custom filters
join_policyandcolumn_policy, which verify whether a column complies with join or column policies in the clean room respectively, and fail the query if it doesn't. A filter is applied to a column name as{{ column_name | filter_name }}.
所有这些要点将在后面详细讨论。
2. The Collaboration template¶
A template is added to a collaboration by embedding it in a YAML specification and registering it, then linking it.
You must request to share a template with a given analysis runner, who can accept or reject the request. Additionally, all data providers for that analysis runner must accept the request for the template to be shared.
3. Running the template¶
Here is how an analysis runner might run this template in code. Note how column names are qualified by the table aliases declared in the template.
开发自定义模板¶
Clean Room 模板就是 JinjaSQL 模板。要创建模板,您应该熟悉以下主题:
Jinja templating basics (https://jinja.palletsprojects.com/en/stable/)
The JinjaSQL extension to Jinja (https://github.com/sripathikrishnan/jinjasql).
You can use Cortex Code to validate the SQL output of your JinjaSQL templates based on variable inputs that should be provided. See example prompts below that you can copy into Cortex Code to get final SQL outputs you can test:
示例:
The rendered template looks like this:
Try running the SQL statement above in your environment to see if it works and gets the expected results.
Then test your template without a WHERE clause:
Rendered template:
Add the template into your clean room, and test with an analysis run spec.
数据保护¶
Templates can access only datasets linked into the clean room by collaborators.
Collaborators specify join, column, and activation policies on their datasets to ensure that only those columns can be used as an input for a template variable.
重要
The template must include the appropriate JinjaSQL policy filter on a column for the policy to be applied.
自定义模板语法¶
Snowflake Data Clean Room 支持 V3 JinjaSQL,并提供如前所述的一些扩展。
本部分包括以下主题:
模板命名规则¶
When creating a template, names must contain only letters, numbers, or underscores.
Template names are assigned in the template specification's name field when you register the template.
有效名称示例:
my_templateactivation_template_1
无效名称示例:
my template- 不允许存在空格my_template!- Special characters not allowed
模板变量¶
Template callers can pass in values to template variables. JinjaSQL syntax enables variable binding for any variable name within {{ double_brackets }}, but Snowflake reserves a few variable names that you shouldn't override, as described below.
小心
All variables, whether Snowflake-defined or custom, are populated by the user and should be treated with appropriate caution. Analysis templates must resolve to a single SELECT statement (activation templates resolve to a script block). Remember that all variables are passed in by the caller.
Snowflake 定义的变量¶
All clean room templates have access to the following global variables defined by Snowflake, but passed in by the analysis runner:
source_table:A zero-based string array of tables and views from data offerings linked into the collaboration via LINK_DATA_OFFERING that can be used by the template.
Example:
SELECT col1 FROM IDENTIFIER({{ source_table[0] }}) AS p;my_table:In a Collaboration clean room,
my_tableis used only by Snowflake Standard Edition users. For these users,my_tableis a zero-based string array of datasets that the analysis runner linked by calling LINK_LOCAL_DATA_OFFERING.Example:
SELECT col1 FROM IDENTIFIER({{ my_table[0] }}) AS c;
自定义变量¶
Template creators can include arbitrary variables in a template that can be populated by the analysis runner. These variables can have any Jinja-compliant name except for the Snowflake-defined variables or table alias names. You should provide guidance in the parameter section of the template for required and optional variables.
自定义变量可由模板访问,如这里所示的自定义变量 max_income:
Analysis runners pass variables when calling RUN as defined in the analysis run spec.
正确解析变量¶
传入模板的字符串值解析为最终模板中的字符串字面量。如果绑定变量处理不当,可能会导致 SQL 解析或逻辑错误:
SELECT {{ my_col }} FROM p;- This resolves toSELECT 'my_col' from p;which simply returns the string "my_col" - probably not what you want.SELECT age FROM {{ source_table[0] }} AS p;- This resolves toSELECT age FROM 'somedb.somesch.source_table' AS p;, which causes a parsing error because a table must be an identifier, not a literal string.SELECT age FROM IDENTIFIER({{ source_table[0] }}) AS p {{ where_clause }};- Passing in "WHERE age < 50" evaluates toSELECT age FROM mytable AS p 'WHERE age < 50';, which is a parsing error because of the literal string WHERE clause.
因此,在适当情况下,您必须解析变量。下面介绍如何在模板中正确解析变量:
- Resolving table and column names
指定表或列名称的变量必须通过以下两种方式之一转换为模板中的标识符:
IDENTIFIER: For example:
SELECT IDENTIFIER({{ my_column }}) FROM p;sqlsafe (https://github.com/sripathikrishnan/jinjasql?tab=readme-ov-file#sql-safe-strings): This JinjaSQL filter resolves identifier strings to SQL text. An equivalent statement to the previous bullet is
SELECT {{ my_column | sqlsafe }} FROM p;
Your particular usage dictates when to use IDENTIFIER or
sqlsafe. For example,p.{{ my_column | sqlsafe }}can't easily be rewritten using IDENTIFIER.- Resolving dynamic SQL
当有字符串变量用作字面量 SQL 时,例如 WHERE 子句,请使用模板中的
sqlsafe筛选器。例如:If a user passes in "age < 50" to
where_clause, the query would resolve toSELECT age FROM sometable AS p WHERE 'age < 50';which is invalid SQL because of the literal string WHERE condition. In this case, you should use thesqlsafefilter:
必填表别名¶
At the top level of your query, all source_table datasets must be aliased as p, and all my_table datasets must be aliased as
c, in order for Snowflake to validate join and column policies correctly in the query. Any column that must be verified against join
or column policies must be qualified with the lowercase p or c table alias.
If you use multiple source_table or my_table datasets in your query, add a numeric, sequential 1-based suffix to each table alias
after the first. So: p or p1, p2, p3, and so on for the first, second, and third source_table datasets, and c or c1, c2,
c3, and so on for the first, second, and third my_table datasets. The p or c index should be sequential without gaps (that
is, create the aliases p1, p2, and p3, not p1, p2, and p4).
示例
Custom clean room template filters¶
Snowflake 支持所有 标准 Jinja 筛选器 (https://jinja.palletsprojects.com/en/stable/templates/#builtin-filters) 和大多数标准 JinjaSQL 筛选器 (https://github.com/search?q=repo%3Asripathikrishnan%2Fjinjasql+self.env.filters+path%3Ajinjasql%2Fcore.py&type=code&path+jinjasql%2Fcore.py=),以及一些扩展:
join_policy:Succeeds if the column is in the join policy of the data owner; fails otherwise. See 将数据保护策略应用于数据产品.
column_policy:Succeeds if the column is in the column policy of the data owner; fails otherwise. See 将数据保护策略应用于数据产品.
activation_policy:Succeeds if the column is in the activation policy of the data owner; fails otherwise. See 将数据保护策略应用于数据产品.
join_and_column_policy:Succeeds if the column is in the join or column policy of the data owner; fails otherwise. See 将数据保护策略应用于数据产品.
identifier:This JinjaSQL filter is not supported by Snowflake templates.
小技巧
JinjaSQL 语句是从左到右评估:
{{ my_col | column_policy }}正确{{ my_col | sqlsafe | column_policy }}正确{{ column_policy | my_col }}不正确{{ my_col | column_policy | sqlsafe }}Incorrect:column_policywill be checked against themy_colvalue as a string, which is an error.
强制实施 Clean Room 策略¶
Clean rooms don't automatically check clean room policies against columns used in a template. If you want to enforce a policy against a column:
You must apply the appropriate policy filter to that column in the template. For example:
You must alias the table as lowercase
porc. See 必填表别名.
Policies are checked only against columns of tables referenced in a source_table variable, which refer to views shared within the clean room. Policies are not checked against columns of tables referenced in a my_table variable, which are local tables not shared within the clean room.
Note that column names can't be ambiguous when testing policies. So if you have columns with the same name in two tables, you must qualify the column name in order to test the policy against that column.
Access considerations and best practices¶
A template is always executed in context to the clean room application role. A collaborator does not have direct access to any data within the clean room that is restricted to template access only; all access is through the native application roles and the template outputs.
As best practice, you should follow the below for templates you create or use in a clean room:
Ensure a policy filter is applied any time a column variable is used in a template, so that collaborator policies are respected.
Wrap user-provided variables with IDENTIFIER() when possible to strengthen templates against SQL injection attacks.
激活模板¶
A template can also be used to save query results to a table outside of the clean room; this is called activation. An activation template is an analysis template with the following additional requirements:
激活模板是计算结果为 SQL 脚本块的 JinjaSQL 语句,可以是简单的 SELECT 语句,这点与分析模板不同。
Activation templates must create an internal table in the clean room to store results. The table generated by the template must have the prefix
cleanroom.activation_data_, for example:cleanroom.activation_data_my_resultsAll columns in the internal results table should have the value
activation_allowed: TRUEin their data offering specification.The script block should end with a RETURN statement that returns the name of the generated table without the
cleanroom.activation_data_prefix, for example:RETURN 'my_results'.The template itself has no naming requirements.
Here is an example activation template specification:
Learn how to implement activation in a collaboration: Activating query results.
后续步骤¶
掌握模板系统后,请阅读使用模板类型实现 Clean Room 的详细说明:
Activation templates create a results table after a successful run and is shared outside of the clean room. Depending on the collaboration specification, the results table can be shared to the analysis runner or other collaborators.
Code bundles are used to upload custom Python UDFs and UDTFs into a collaboration. Templates in the collaboration can run these functions to perform complex data actions.
Internal tables are used to store intermediary or persistent results, which can be used downstream to support multistep workflows. These tables are accessible to templates or custom uploaded code inside the clean room.
更多信息¶
Jinja 文档 (https://jinja.palletsprojects.com/en/stable/)
JinjaSQL 文档 (https://github.com/sripathikrishnan/jinjasql)