在 Clean Room 中上传和运行自定义函数

概述

您可以将自定义 Python UDFs 和 UDTFs 上传到您的 Clean Room 中,然后通过模板运行它们来执行复杂的数据操作。这些操作包括机器学习或查询中的自定义数据操作,是单步或多步 流程 的一部分。Python 是唯一支持自定义 UDFs 的编码语言。

您上传的代码可以导入和使用 经批准的 Python 包捆绑包 (https://repo.anaconda.com/pkgs/snowflake/) 和 Snowpark API 中的包。

Both providers and consumers can upload custom Python code into a clean room, although the process is different for providers and consumers.

This topic shows how to upload and run custom Python UDFs and UDTFs as either a provider or a consumer.

小技巧

有关如何在 Clean Room 中开发自己的 Python UDFs 的背景信息,请参阅以下主题:

Running uploaded code

Each bundle of uploaded code can define multiple functions that call each other, but a bundle exposes only one handler function. This handler function can be called by templates created or run by anyone who uses the clean room. If the code creates internal tables, these tables can be accessed as described in 设计多步流程.

For example, if you uploaded a function named simple_add that takes in two numeric parameters, you can call it from a template as shown here. The function is always referenced using the scope cleanroom. For example, a template could call simple_add like this:

SELECT cleanroom.simple_add({{ price | sqlsafe | int }}, {{ tax | sqlsafe | int }}) ...
Copy

小技巧

If the provider wants to run the code above, they must alias all SELECT columns that use an aggregate or custom function, because a results table is generated behind the scenes:

SELECT
  cleanroom.simple_add(
    {{ price | sqlsafe | int }}, {{ tax | sqlsafe | int }}
    ) AS TOTAL_ITEM_COST
...
Copy

You can upload multiple functions in a single package, and functions within a single package can call each other, but functions can't call functions within other packages. (They can call the handler functions, though.) For example:

Clean room with two uploaded Python packages

Package 1

Package 2

  • Handler function A

  • Helper function A1

  • Helper function A2

  • Handler function B

  • Helper function B1

  • Helper function B2

Notes:

  • Code uploaded by either side (provider and consumer) can be run by either side.

  • A template can call function A or function B, but not A1, A2, B1, B2.

  • Function A can call function B, and the reverse.

  • Function A can't call B1 or B2 and Function B can't call A1 or A2.

  • A1 can call A2 and the reverse. A1 and A2 can call B. A1 and A2 can't call B1 or B2.

  • B1 can call B2 and the reverse. B1 and B2 can call A. B1 and B2 can't call A1 or A2.

Updating or deleting custom functions

You can upload or overwrite an existing function or template that you uploaded, but you can't delete an existing function or template. The only way to "remove" a function is to create a dummy function with the exact same name and signature that always succeeds.

如果上传的函数与先前上传的函数签名完全相同,则会覆盖现有函数。签名由外部处理程序的函数名称(不区分大小写)以及所有参数的数据类型(顺序相同)构成。参数名称无关紧要。您不能覆盖其他账户上传的函数。

由于在更新函数时,签名必须匹配,因此您无法更改现有函数的签名:如果您上传函数 foo(name VARIANT age INTEGER),然后又上传函数 foo(name VARIANT age FLOAT),则除了第一个函数外,第二个函数也会添加到 Clean Room 中,因为实参类型不同。

提供商提交的代码

提供商提交的函数可以作为内联代码上传,也可以从 Snowflake 暂存区上传。本文会介绍这两种技术。

您上传的代码可以原生导入并使用 一组经批准的 Python 包 (https://repo.anaconda.com/pkgs/snowflake/) 中的包。如果您需要非默认包,则必须 在 Clean Room 中使用 Snowpark Container Services 来托管代码。

You can't view uploaded provider code, even your own code, so be sure to include a copy of exactly what you upload into a clean room.

小技巧

更新提供商编写的代码后,您应该更新默认发布指令,然后调用 provider.create_or_update_cleanroom_listing 将更改传播给使用者。如果您未调用 provider.create_or_update_cleanroom_listing,则对于当前正在使用 Clean Room 的使用者,您的默认版本将不会更新。

下面简要介绍了提供商如何向 Clean Room 添加代码:

  1. 提供商以正常方式创建和配置 Clean Room。

  2. 提供商通过调用 provider.load_python_into_cleanroom 上传代码。您可以直接在该过程中 内联上传代码,也可以 将代码文件上传到某个暂存区,然后将暂存区位置提供给该过程。

    Although your code can include multiple functions, only one handler is exposed for each upload. To expose multiple functions to templates, upload each handler by calling provider.load_python_into_cleanroom.

  3. 每次成功上传代码后,都会生成一个 新的 Clean Room 补丁版本。然后,您必须使用新的补丁号调用 provider.set_default_release_directive,以递增默认版本号。如果 Clean Room 对外公开,则在安装代码之前会运行安全检查,并且在递增默认版本号之前,您必须调用 provider.view_cleanroom_scan_status 以确认安全检查已通过。

    • If you want to upload multiple functions in a single patch, you can bulk upload your code. However, bulk uploads can make it more challenging to debug if the upload has a security scan issue, because the file that caused the problem isn't reported in the error response.

  4. You create and upload a custom template that calls your code. The template must call the handler function using the cleanroom scope, that is: cleanroom.my_function(...).

  5. 使用者按照运行其他任何模板的相同方式运行您的模板。

    小技巧

    如果使用者在安装包含自定义代码的 Clean Room 时遇到挂载错误,这可能表明代码中存在语法错误。

您可以在 提供商编写的代码示例部分 中找到演示此流程的代码示例。

关于版本控制的重要注意事项

提供商每次上传函数时,都会递增补丁编号(补丁编号的数量上限为 99 个)。因此,在将代码添加到 Clean Room 之前,应尽可能对其进行彻底的测试和调试,以减少开发期间的版本更新。

如果您确实更新了补丁编号,则使用 Clean Room UI 的客户可能需要刷新页面才能看到更改。使用 API 的客户应该会立即看到更改,但实际上可能会有延迟,具体取决于可用资源。了解有关 Clean Room 版本控制的更多信息。

上传提供商编写的内联函数

您可以在 provider.load_python_into_cleanroomcode 参数中通过内联的方式上传代码。以下是通过内联的方式上传简单函数的示例:

CALL samooha_by_snowflake_local_db.provider.load_python_into_cleanroom(
$cleanroom_name,
'simple_add',                         -- Name used to call the UDF from a template.
['first INTEGER', 'second INTEGER'],  -- Arguments of the UDF, specified as '<variable_name> <SQL type>' pairs.
['numpy', 'pandas'],                  -- Packages imported by the UDF.
'INTEGER',                            -- SQL return type of UDF.
'add_two',                            -- Handler function in your code called when external name is called.
$$
import numpy as np   # Not used, but you can load supported packages.
import pandas as pd

def add_two(first, second):
    return first + second
$$
);
Copy

调用方模板会调用 cleanroom.simple_add 以调用此函数。提供商示例 演示如何上传内联代码。

从暂存区上传提供商编写的函数

You can upload Python files to a clean room stage and reference the stage when you call provider.load_python_into_cleanroom. Loading code from a stage allows you to develop the code in your local system in an editor, avoid copy/paste errors when loading it inline, and also have better versioning control. Note that you can upload multiple files in a single procedure call, but only one handler function is exposed for each upload.

在您调用 load_python_into_cleanroom 时,代码会从一个暂存区加载到 Clean Room 中;后续对暂存区中代码的更改不会传播到 Clean Room。

要将 UDF 上传到暂存区,请执行以下操作:

  1. 创建 .py 文件,并将其置于可以上传到 Snowsight 暂存区的位置。

  2. 要获取 Clean Room 的暂存区名称,请调用 provider.get_stage_for_python_files($cleanroom_name)。Clean Room 可以访问此暂存区,您不能使用您创建的任意暂存区。

  3. 将 .py 文件上传到 Clean Room 的暂存区。有 几种方法可以实现此目的,包括使用 CLI、Snowsight 或特定语言的驱动程序。

  4. 使用暂存区位置、处理程序、外部名称、实参和返回类型调用 provider.load_python_into_cleanroom。现在,Clean Room 中的模板可以调用此函数。

以下示例代码显示如何从暂存区将代码加载到 Clean Room 中。

-- Save the following code as reverser.py:
--import numpy as np
--def main(some_string):
--  '''Return the reverse of a string plus a random number 1-10'''
--  return some_string[::-1] + str(np.random.randint(1,10))

-- Get the stage for your clean room.
CALL samooha_by_snowflake_local_db.provider.get_stage_for_python_files($cleanroom_name);

-- Save the file to the stage. Here is how to do it by using the Snowflake CLI
PUT file://~/reverser.py <STAGE_NAME> overwrite=True auto_compress=False;

-- Load the code from the stage into the clean room.
CALL samooha_by_snowflake_local_db.provider.load_python_into_cleanroom(
    $cleanroom_name,
    'reverse', -- Name used to call the function
    ['some_string  STRING'], -- Arguments and SQL types
    ['numpy'],               -- Any required packages
    ['/reverser.py'],        -- Relative path to file on stage
    'STRING',                -- Return type
    'reverser.main'          -- <FILE_NAME>.<FUNCTION_NAME>
);

-- Uploading code, even from a stage, increases the patch number.
CALL samooha_by_snowflake_local_db.provider.set_default_release_directive(
  $cleanroom_name, 'V1_0', <NEW_PATCH_NUMBER>);

-- Upload a template that calls the function.
CALL samooha_by_snowflake_local_db.provider.add_custom_sql_template(
    $cleanroom_name,
    $udf_template_name,
    $$
    SELECT
      p.status,
      cleanroom.reverse(p.status)
    FROM SAMOOHA_SAMPLE_DATABASE.DEMO.CUSTOMERS AS p
    LIMIT 100;
    $$
);

-- Switch to the consumer account and run the template to see the results.
Copy

提供商示例 演示了如何从一个暂存区上传代码。

Troubleshooting syntax errors or scan failures in uploaded code

If you upload a function that fails because of a syntax error, or if a security scan fails, an unpublishable patch can be generated. Therefore, you should thoroughly test your code before upload to ensure that it has no syntax errors.

You can see the list of packages, and their review status, by running the following SQL command, providing the clean room ID in the place indicated:

SHOW VERSIONS IN APPLICATION PACKAGE samooha_cleanroom_cleanroom_id;

Security scans

A security scan is run after any action that generates a new patch version in an external clean room, such as when the provider uploads Python into the clean room. (Consumer-submitted code, described on this page, does not trigger a security scan.) Internal clean rooms do not run security scans, but if you change an internal clean room to an external clean room, it will trigger a security scan for that patch. A clean roomm patch cannot be published externally until the patch has been scanned.

Snowflake Data Clean Rooms uses the Snowflake Native App security scan framework. Follow the native app security best practices to avoid security scan errors.

You can perform additional patch-creating actions before the last security scan is complete. However, you must wait for provider.view_cleanroom_scan_status to show success before you can update the default release directive in order to serve the latest version of the clean room.

Uploading multiple Python functions in a single patch

If you want to upload multiple template-callable Python packages to your clean room, you can call prepare_python_for_cleanroom multiple times, then call load_prepared_python_into_cleanroom once to scan, upload, and generate a single patch for your clean room. The following example demonstrates uploading a UDF and a UDTF using bulk uploading:

---- Add custom inline UDF ----
CALL samooha_by_snowflake_local_db.provider.prepare_python_for_cleanroom(
    $cleanroom_name,
    'get_next_status',  -- Name of the UDF. Can be different from the handler.
    ['status VARCHAR'], -- Arguments of the UDF, specified as (variable name, SQL type).
    ['numpy'],          -- Packages needed by UDF.
    [],                 -- When providing the code inline, this is an empty array.
    'VARCHAR',          -- Return type of UDF.
    'get_next_status',  -- Handler.
    $$
import numpy as np
def get_next_status(status):
  """Return the next higher status, or a random status
  if no matching status found or at the top of the list."""

  statuses = ['MEMBER', 'SILVER', 'GOLD', 'PLATINUM', 'DIAMOND']
  try:
    return statuses[statuses.index(status.upper()) + 1]
  except:
    return 'NO MATCH'
    $$
);

---- Add custom inline UDTF. ----
CALL samooha_by_snowflake_local_db.provider.prepare_python_for_cleanroom(
    $cleanroom_name,
    'get_info',  -- Name of the UDTF. Can be different from the handler.
    ['hashed_email VARCHAR', 'days_active INT', 'status VARCHAR', 'income VARCHAR'],   -- Name/Type arguments of the UDTF.
    ['numpy'],         -- Packages used by UDTF.
    [],                -- When providing the code inline, this is an empty array.
    'TABLE(hashed_email VARCHAR, months_active INT, level VARCHAR)',  -- Return type of UDTF.
    'GetSomeVals',     -- Handler class name.
$$
class GetSomeVals:
  def __init__(self):
    self.month_days = 30

  def process(self, hashed_email, days_active, status, income):
    '''Change days into rough months, and also return whether we
    think the user's membership status is lower, higher, or equal to
    what is expected, based on their income.'''

    months_active = days_active // self.month_days
    brackets = ['0-50K', '50K-100K', '100K-250K', '250K+']
    statuses = ['MEMBER', 'SILVER', 'GOLD', 'PLATINUM']
    if(statuses.index(status) < brackets.index(income)):
      level = 'low'
    elif(statuses.index(status) > brackets.index(income)):
      level = 'high'
    else:
      level = 'equal'

    yield(hashed_email, months_active, level)
$$
);

-- Upload all stored procedures.
-- Note the new patch number returned by this procedure. Keep this number for later use.
CALL samooha_by_snowflake_local_db.provider.load_prepared_python_into_cleanroom($cleanroom_name);

-- Set the release directive specified by the last load_python_into_cleanroom call.
CALL samooha_by_snowflake_local_db.provider.set_default_release_directive($cleanroom_name, 'V1_0', <PATCH_NUMBER>);
Copy

提供商编写的代码示例

以下示例演示如何将提供商编写的 UDFs 和 UDTFs 添加到 Clean Room 中。

下载以下示例,并将其作为工作表文件上传到 Snowflake 账户中。您需要为提供商和使用者开设单独的账户,每个账户都要安装 Clean Room API。按照示例文件中的说明替换信息。

使用者提交的代码

Consumer-uploaded code is bundled and uploaded with a custom template using the consumer template upload flow. The uploaded code can be called by any template in the clean room.

To upload code as a consumer, you should understand custom template syntax.

Note that any code uploaded by a consumer can be seen by the provider when they request permission to upload. The consumer code is also visible whenever a provider or consumer examines the template.

以下是上传自定义使用者代码的步骤概述:

  1. 提供商以标准方式创建 Clean Room,然后邀请使用者。

  2. 使用者以标准方式安装和配置 Clean Room。

  3. The consumer prepares a template that calls the UDF or UDTF within the cleanroom namespace. For example, to call the consumer-defined calculate_tax function, a simple template might look like the following snippet:

    SELECT {{ cleanroom.calculate_tax(p.cost) }} AS Tax FROM my_db.my_sch.sales AS p;
    
    Copy
  4. The consumer prepares their Python code. We recommend using double quotation marks (" ") rather than single quotation marks (' ') in your code to avoid extra escaping needed later. Your code can reference these supported Python libraries (https://repo.anaconda.com/pkgs/snowflake/).

  5. 使用者将其 Python 代码传递到 consumer.generate_python_request_template 中。该过程以存储过程的形式返回 Python 代码,其中包含自定义 JinjaSQL 模板的占位符。模板中有几个将 $$ 用作多行分隔符的多行字符串。

  6. The consumer replaces the template placeholder in the output from generate_python_request_template with their JinjaSQL template.

  7. In the combined template, escape any single quotes like this: \'. This is because single quotes will be used as the outermost delimiter for the entire multi-line procedure string when you upload it to the clean room. Here is an example of a stored procedure that includes the consumer Python code and custom template, with character escaping:

      BEGIN
    
      CREATE OR REPLACE FUNCTION CLEANROOM.custom_compare(min_status STRING, max_status STRING, this_status STRING)
      RETURNS boolean
      LANGUAGE PYTHON
      RUNTIME_VERSION = 3.10
      PACKAGES = (\'numpy\')
    
      HANDLER = \'custom_compare\'
      AS $$
      import numpy as np
    
      def custom_compare(min_status:str, max_status:str, this_status:str):
        statuses = [\'MEMBER\', \'SILVER\', \'GOLD\', \'PLATINUM\']
        return ((statuses.index(this_status) >= statuses.index(min_status)) &
                (statuses.index(this_status) <= statuses.index(max_status)))
      $$;
    
      -- Custom template
      LET SQL_TEXT varchar := $$
      SELECT
        c.status,
        c.hashed_email
      FROM IDENTIFIER( {{ my_table[0] }} ) as c
      WHERE cleanroom.custom_compare({{ min_status }}, {{ max_status }}, c.status);
      $$;
    
      LET RES resultset := (EXECUTE IMMEDIATE :SQL_TEXT);
      RETURN TABLE(RES);
    
      END;
    
    Copy
  8. The consumer calls consumer.create_template_request with the combined template. Use single quotation marks (' ') instead of double dollar sign delimiters ($$...$$) around the code you provide for stored procedure in the template_definition argument. For example:

    CALL samooha_by_snowflake_local_db.consumer.create_template_request(
      $cleanroom_name,
      $template_name,
      '
    BEGIN
    
    -- First, define the Python UDF.
    CREATE OR REPLACE FUNCTION CLEANROOM.custom_compare(min_status STRING, max_status STRING, this_status STRING)
    RETURNS boolean
    LANGUAGE PYTHON
    RUNTIME_VERSION = 3.10
    PACKAGES = (\'numpy\')
    
    HANDLER = \'custom_compare\'
    AS $$
    import numpy as np
    
    def custom_compare(min_status:str, max_status:str, this_status:str):
      statuses = [\'MEMBER\', \'SILVER\', \'GOLD\', \'PLATINUM\']
      return ((statuses.index(this_status) >= statuses.index(min_status)) &
              (statuses.index(this_status) <= statuses.index(max_status)))
        $$;
    
    -- Then define and execute the SQL query.
    LET SQL_TEXT varchar := $$
    SELECT
      c.status,
      c.hashed_email
    FROM IDENTIFIER( {{ my_table[0] }} ) as c
    WHERE cleanroom.custom_compare({{ min_status }}, {{ max_status }}, c.status);
    $$;
    
    -- Execute the query and then return the result.
    LET RES resultset := (EXECUTE IMMEDIATE :SQL_TEXT);
    RETURN TABLE(RES);
    
    END;
    ');
    
    Copy
  9. 使用者和提供商继续使用标准 使用者定义的模板流程

    1. The provider views the template request (provider.list_pending_template_requests) and then approves it by calling approve_template_request. In the request, the provider can see the template and the bundled code.

    2. 使用者检查请求状态 (consumer.list_template_requests),当状态为 APPROVED 时,运行模板 (consumer.run_analysis)。

使用者编写的代码示例

以下示例演示了如何向 Clean Room 添加由提供商编写的 UDFs。

下载以下示例,并将其作为工作表文件上传到 Snowflake 账户中。您需要为提供商和使用者开设单独的账户,每个账户都要安装 Clean Room API。按照示例文件中的说明替换信息:

语言: 中文