设计高性能外部函数
本主题介绍外部函数的并发性、可靠性和可扩展性,包括有关使用异步外部函数的信息。
异步与同步远程服务
远程服务可以是同步的,也可以是异步的。
- asynchronous:
在调用方等待结果期间可以轮询异步远程服务。
异步处理降低了对于超时的敏感度。
For more information about asynchronous services, see Microsoft’s description of Asynchronous Request-Reply Pattern (https://docs.microsoft.com/en-us/azure/architecture/patterns/async-request-reply) . (The information is not limited to Microsoft Azure.)
同步远程服务接收 HTTP POST 请求,然后处理请求并返回结果。根据处理数据所需时间,从收到请求到返回结果之间可能会存在长时间的延迟。
异步远程服务接收 HTTP POST 请求,并返回表明已收到请求的确认(通常几乎是即时的)。然后,调用方 (Snowflake) 执行轮询循环,在该循环中发出一个或多个 HTTP GET 请求(通常每个请求之间会有较长时间的延迟)以检查异步处理的状态。GET 不会在请求正文中发送任何数据,但包含与原始 POST 标头相同的标头。
当远程服务超出代理服务(例如 Amazon API Gateway)等组件内置的超时时,异步远程服务很有用。
远程服务不一定是纯同步或纯异步形式。在不同的时间,远程服务可同步运行,也可异步运行,具体取决于请求中的数据量、正在处理的其他请求数量等因素。
Snowflake 的外部函数实现通常兼容同步和异步这两类第三方函数库。
下图对比了同步处理与异步处理。上面的路径是同步的。下面的路径(包括一个或多个 HTTP GET 请求)是异步的。

To view examples of synchronous and asynchronous external functions, see Snowflake Sample Functions.
同步远程服务
在用户调用外部函数之前,开发者和 Snowflake 账户管理员必须配置 Snowflake 以访问代理服务。通常,这些步骤大致按下方所示顺序完成(从上图右侧开始,向左朝着 Snowflake 移动)。
-
A developer must write the remote service, and that remote service must be exposed via the HTTPS proxy service. For example, the remote service might be a Python function running on AWS Lambda and exposed via a resource in the Amazon API Gateway.
-
In Snowflake, an ACCOUNTADMIN or a role with the CREATE INTEGRATION privilege must create an “API integration” object that contains authentication information that enables Snowflake to communicate with the proxy service. The API integration is created with the SQL command CREATE API INTEGRATION.
-
A Snowflake user must execute the SQL command CREATE EXTERNAL FUNCTION. The user must use a role that has USAGE privilege on the API integration and has sufficient privileges to create functions.
Note
The CREATE EXTERNAL FUNCTION command does not actually create an external function in the sense of loading code that will be “executed outside Snowflake”. Instead, the CREATE EXTERNAL FUNCTION command creates a database object that indirectly references the code that executes outside Snowflake. More precisely, the CREATE EXTERNAL FUNCTION command creates an object that contains:
- HTTPS 代理服务中充当中继函数的资源的 URL。
- 用于对代理服务进行身份验证的 API 集成的名称。
- A name that is effectively an alias for the remote service. This alias is used in SQL commands,
for example
SELECT MyAliasForRemoteServiceXYZ(col1) ...;
Snowflake 中的别名、HTTPS 代理服务资源的名称和远程服务的名称可能彼此不同。(但为三者使用相同的名称可以简化管理。)
尽管上述步骤是执行外部函数的最常见方式,但可以做一些调整。例如:
- The remote service might not be the final step in the chain; the remote service could call yet another remote service to do part of the work.
- If the remote service doesn’t accept and return JSON-formatted data, then the HTTPS proxy service’s resource (the relay function) could convert the data from JSON format to another format (and convert the returned data back to JSON).
- Although Snowflake recommends that the remote service behave as a true function (i.e. a piece of code that accepts 0 or more input parameters and returns an output) that has no side effects and keeps no state information, this is not strictly required. The remote service could perform other tasks, for example sending alerts if a value (such as a temperature reading in the data) is dangerously high. In rare cases, the remote service might keep state information, for example the total number of alerts issued.
异步远程服务
An asynchronous remote service is useful when a remote service exceeds the timeouts built into components such as the proxy service.
异步远程服务涉及到与上述相同的组件(客户端、Snowflake、代理服务和远程服务),以及与上述相同的一般步骤。但 HTTP 请求和响应的详细信息不同。
异步行为由编写远程服务的人员(以及 Snowflake)实现。异步远程服务的 SQL 语句与同步远程服务相同。
如果您正在编写自己的远程服务,且希望使其与 Snowflake 的异步处理兼容,请按如下方式编写远程服务的行为:
- When it initially receives an HTTP POST for a specific batch of rows, the remote service returns HTTP code 202 (“Processing…”).
- If the remote service receives any HTTP GET requests after the POST but before the output is ready, the remote service returns HTTP code 202.
- After the remote service has generated all of the output rows, it waits for the next HTTP GET with the same batch ID, and then returns the rows received, along with HTTP code 200 (“Successful completion…”).
简而言之,对于收到的每个批次,远程服务都会返回 202,直到结果准备就绪,随后下一个 GET 接收结果,再返回 HTTP 200。
对于每个批次,Snowflake 都使用异步远程服务,如下所示:
-
Snowflake 发送 HTTP POST,其中包含要处理的数据以及唯一批次 ID。
-
如果 Snowflake 收到 HTTP 202 响应,则 Snowflake 会进入循环,直到满足以下条件之一:
- Snowflake 接收到数据和 HTTP 200。
- 已达到 Snowflake 的内部超时时间。
- Snowflake 收到错误(例如 HTTP 响应代码 5XX)。
在循环的每次迭代中,Snowflake 都会延迟,然后发出 HTTP GET,其中包含的批次 ID 与响应方 HTTP POST 的批次 ID 相同,这样远程服务就可以返回正确批次的信息。
循环内部的延迟最初很短,但每次收到 HTTP 202 响应时,延迟就会延长,直至达到 Snowflake 的超时时间。
- 如果在返回 HTTP 200 之前就达到了 Snowflake 的超时时间,则 Snowflake 会中止 SQL 查询。
目前,Snowflake 的超时时间为 10 分钟(600 秒),用户不能对此进行配置。这个超时时间在未来可能会调整。
Note
The frequency with which queries hit timeouts depends in part upon the scalability of the remote service. If your remote service times out frequently, then see also the discussion of 可扩展性.
可扩展性
远程服务、代理服务以及 Snowflake 和远程服务之间的其他任何步骤都必须能处理发送给它们的峰值工作负载。
一些云平台提供商对代理服务和远程服务设有默认使用限制或其他配额,这可能会限制外部函数调用的吞吐量。
Larger Snowflake warehouse sizes can increase the concurrency with which requests are sent, which might exceed the proxy service’s quota.
Users can see how many times Snowflake had to retry sending request batches (due to throttling or other errors) for a query by looking at the value for Retries due to transient errors on the query profile.
远程服务的可扩展性
编写远程服务的开发者应该考虑:
- 远程服务的调用频率。
- 每次调用发送的行数。
- 处理每行所需的资源。
- 调用的时间分布(峰值与平均值)。
随着调用方从少数开发者和测试者转变为整个组织,容量可能需要随着时间的推移而增加。如果多个组织使用远程服务,则容量可能需要随着组织数量的增加而增加。此外,随着组织数量和多样性的增加,工作负载的大小和时间可能变得更加难以预测。
远程服务提供商负责提供足够的容量,以处理峰值工作负载。可以使用不同的技术来扩展服务。如果远程服务由其作者管理,则作者可能需要显式为该服务预配足够的容量,以处理峰值。或者,作者可能会决定使用托管的自动扩展/弹性服务,例如 AWS Lambda。
远程服务在超载时应返回 HTTP 响应代码 429。如果 Snowflake 看到 HTTP 429,则会降低其发送行的速率,并重试发送未成功处理的行批次。
For more information about troubleshooting scalability issues, see 排查可扩展性和性能问题.
If remote service invocations time out because each individual invocation takes a long time, rather than because the system is generally overloaded, then see the description of how to build an 异步远程服务.
代理服务的可扩展性
代理服务也应该具备可扩展性。幸运的是,主要云提供商提供的代理服务通常都具备可扩展型。
但部分代理服务存在默认的使用限制,包括 Amazon API Gateway 和 Azure API 管理在内。在请求速率超过该限制时,这些代理服务会对请求进行限制。如有必要,可以请 AWS 或 Azure 调高代理服务的配额。
开发或管理外部函数的用户应记住以下平台特定信息:
- Amazon API Gateway:
The Amazon API Gateway is itself a managed AWS service, which auto-scales to users’ workloads. Users should be familiar with various limits of API Gateway (https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html) .
可配置 Amazon API Gateway 以帮助扩展远程服务。具体而言,可以将 API Gateway 配置为启用请求的缓存和/或限制,以按需减少远程服务的负载:
- Enable caching (https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-caching.html)
- Enable throttling (https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html)
由于限制会影响超时和重试,用户可能还需要查看有关 Snowflake 如何处理超时和重试的信息:
- Azure API Management Service:
For Azure API Management, the limits depend on the SKU chosen for the service. The limits are documented in the API Management limits section of the Azure Subscription Service Limits (https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#api-management-limits) .
由于限制会影响超时和重试,用户可能还需要查看有关 Snowflake 如何处理超时和重试的信息:
排查可扩展性和性能问题
- Use the QUERY_HISTORY , QUERY_HISTORY_BY_* function to observe performance characteristics and help debug performance issues.
- Use the Query History page page to see average latency per request.
- Use the Query History page page to see how many times requests were retried due to transient errors, including those listed in the section titled Do not assume that the remote service is passed each row exactly once.
- Monitor your remote service resource usage to see how it scales to the load, and ensure that the remote service has enough capacity to serve peak load.
- 登录 Amazon API Gateway 或远程服务,获取每个请求的详细信息。
- Control the concurrency with which Snowflake sends requests to their remote service. For more details, see concurrency.
- Return HTTP Response Code 429 from the remote service when it is overloaded. Return this as early as possible, rather than wait for latency to increase.
- Take into account the proxy service timeout. For example, as of July 2020, the timeout for Amazon API Gateway is 30 seconds. Timeouts can be caused by various factors, including overloading of the remote service.
Snowflake 尝试在合理的时间内重试瞬态错误/超时,但是如果服务持续过载且重试不成功,查询最终会中止。
并发性
资源需求取决于行在多个调用之间的分布方式(这两种分布方式截然不同:许多并行调用但每个调用只有几行,或是总行数不变,但仅使用一个调用)。支持大容量的系统不一定支持高并发性,反之亦然。应该对所需峰值并发性及最大合理的单个工作负载进行估测,并提供足够的资源来处理这两种类型的峰值。
此外,并发估测应考虑到 Snowflake 可以并行处理外部函数调用。单个用户发出的单个查询可能会导致对远程服务的多次并行调用。有几个因素会影响从 Snowflake 到代理服务或远程服务的并发调用次数,其中包括:
- 使用外部函数运行查询的并发用户的数量。
- 各用户的查询大小。
- The amount of compute resources in the virtual warehouse (i.e. the warehouse size).
- The number of warehouses.
Handling concurrency properly can be particularly complex if external functions have side effects. The results can vary depending upon the order in which user’s rows are processed. (Snowflake recommends that you avoid writing or using remote services that have side effects.)
可靠性
根据远程服务的运行位置,您可能需要考虑以下因素:
- 可靠性。
- 错误处理。
- 调试。
- 升级(如果远程服务有可能添加新功能或需要修复错误)。
如果远程服务并非无状态,则可能还需要考虑在失败后恢复。(Snowflake 强烈建议将远程服务设置为无状态服务。)
For information about timeouts and retries, see Account for timeout errors and Do not assume that the remote service is passed each row exactly once.