远程服务输入和输出数据格式

When Snowflake sends data to a remote service, or receives data from a remote service, the data must be formatted correctly. This topic provides information about the proper data formats. Data received from and returned to Snowflake must also be of an appropriate data type.

When executing an external function, for example, Snowflake sends and expects data in the format described here. It sends the data to a proxy service, not directly to the remote service (for more, see Introduction to external functions). Therefore, the proxy service must receive (and return) data in a Snowflake-compatible format. Although typically the proxy service passes data through unchanged, the proxy can reformat data (both sending and receiving) to meet the needs of both the remote service and Snowflake.

为简单起见,并帮助说明 Snowflake 期望发送和接收的格式,本部分中的大多数示例都假定远程服务读取和写入与 Snowflake 期望的格式相同的格式数据,且代理服务在两个方向上都不变地传递数据。

Snowflake 发送的数据格式

来自 Snowflake 的每个 HTTP 请求都是一个 POST 或一个 GET。

  • A POST request contains headers and a request body. The request body includes a batch of rows.
  • A GET contains only headers, and is used only for polling when the remote service returns results asynchronously.

正文格式

POST 请求的正文包含按 JSON 格式序列化的数据。

JSON 架构是:

  • 顶层是一个 JSON 对象(一组名称对/值对,也称为“字典”)。

  • 目前,该对象只有一个项目;该项目的键名为“数据”。

  • 该“数据”项目的值是一个 JSON 数组,其中:

    • 每个元素都是一行数据。
    • 每行数据都是由一列或多列组成的 JSON 数组。
    • 第一列始终是行号(即批中的行从 0 开始的索引)。
    • 其余列包含函数的实参。
  • 数据类型按如下方式序列化:

有关在每个平台上的远程服务中提取数据的示例,请参阅:

Optionally, the JSON can be compressed for transmission over the network. Compression is documented in CREATE EXTERNAL FUNCTION.

正文示例

Here’s an example of a serialized request for an external function with the signature f(integer, varchar, timestamp). The first column is the row number within the batch, and the next three values are the arguments to the external function.

{
    "data": [
                [0, 10, "Alex", "2014-01-01 16:00:00"],
                [1, 20, "Steve", "2015-01-01 16:00:00"],
                [2, 30, "Alice", "2016-01-01 16:00:00"],
                [3, 40, "Adrian", "2017-01-01 16:00:00"]
            ]
}

标头格式

标头信息通常以一组键对/值对的形式提供给远程服务。标头信息包括:

  • 以下 HTTP 标头:
    • 描述如何在请求正文中序列化数据的标头:

      • “sf-external-function-format”:当前始终设置为“json”。
      • “sf-external-function-format-version”:当前始终设置为“1.0”。
    • “sf-external-function-current-query-id”: This contains the query ID of the query that called this external function. You can use this to correlate Snowflake queries to calls of the remote service, for example to help debug issues.

    • “sf-external-function-query-batch-id”: The batch ID uniquely identifies the specific batch of rows processed with this request. The remote service can use this ID to track the status of a batch that is being processed. The ID can also be used as an idempotency token if requests are retried due to an error. The ID can also be used for logging/tracing of requests by the remote service.

GET 中的批 ID 与对应 POST 中的批 ID 相同。

批 ID 是由 Snowflake 生成的 Opaque 值。格式在将来的版本中可能会更改,因此远程服务不应依赖特定格式或尝试解释该值。

  • Headers that describe the signature (name and argument types) and return type of the external function that was called in the SQL query. These values can have characters that are not standard characters for Snowflake identifiers, so base64 versions of the information are included, and non-standard characters are replaced with a blank in the non-base64 versions.

特定标头包括:

  • sf-external-function-name
  • sf-external-function-name-base64
  • sf-external-function-signature
  • sf-external-function-signature-base64
  • sf-external-function-return-type
  • sf-external-function-return-type-base64

For example, the headers sent for the function ext_func(n integer) returns varchar are:

  • sf-external-function-name: ext_func
  • sf-external-function-name-base64: <base64 value>
  • sf-external-function-signature:(N NUMBER)
  • sf-external-function-signature-base64: <base64 value>
  • sf-external-function-return-type:VARCHAR(134217728)
  • sf-external-function-return-type-base64: <base64 value>

Because SQL INTEGER values are treated as SQL NUMBER, the SQL argument declared as type INTEGER is described as type NUMBER.

  • Additional optional metadata described in the “headers” and “context_headers” properties of CREATE EXTERNAL FUNCTION.

标头访问示例

要从用 Python 编写的 AWS Lambda 函数(该函数将标头接收为 Python 字典)中提取“sf-external-function-signature”标头,请执行以下操作:

def handler(event, context):

    request_headers = event["headers"]
    signature = request_headers["sf-external-function-signature"]

其他语言和其他云平台的详细信息会有所不同。

For remote services developed on AWS, more information about headers and lambda proxy integration is available in the AWS API Gateway documentation (https://docs.aws.amazon.com/apigateway/latest/developerguide/set-up-lambda-proxy-integrations.html#api-gateway-simple-proxy-for-lambda-input-format) .

Snowflake 接收的数据格式

正文格式

当远程服务完成处理批时,远程服务应以类似于 Snowflake 所发送数据的 JSON 格式将数据发送回 Snowflake。

返回到 Snowflake 的 JSON 响应应包含 Snowflake 发送的每一行的一行。每个返回的行包含两个值:

  • 行号(即批中的行从 0 开始的索引)。
  • The value returned from the function for that row. The value can be a compound value (for example, an OBJECT), but it must be exactly one value because all scalar Snowflake functions (external or otherwise) return a single value.

为了使 Snowflake 能够将响应与请求相关联,返回数据中的行号必须与 Snowflake 发送的数据中的行号相对应,并且必须按照接收的相同顺序返回。

正文访问示例

下面的 JSON 示例显示了包含 OBJECT 值的两行,每行前面都有一个行号:

{
    "data":
        [
            [ 0, { "City" : "Warsaw",  "latitude" : 52.23, "longitude" :  21.01 } ],
            [ 1, { "City" : "Toronto", "latitude" : 43.65, "longitude" : -79.38 } ]
        ]
}

若要使用 Python 编写这些返回行的其中一行,您可使用以下代码:

...
row_number = 0
output_value = {}

output_value["city"] = "Warsaw"
output_value["latitude"] = 21.01
output_value["longitude"] = 52.23
row_to_return = [row_number, output_value]
...

To access the OBJECT value of a returned row with SQL, use the notation described in Traversing Semi-structured Data. For example:

select val:city, val:latitude, val:longitude
    from (select ext_func_city_lat_long(city_name) as val from table_of_city_names);

标头格式

响应还可以包含以下可选的 HTTP 标头:

  • Content-MD5: Snowflake uses the optional Content-MD5 header to check the integrity of the response. If this header is included in the response, Snowflake computes an MD5 checksum on the response body to ensure that it matches the corresponding checksum in the returned header. If the values do not match, the SQL query fails. The checksum should be encoded in a base64 representation before being returned in the header. See the example code below.

Optionally, the JSON can be compressed for transmission over the network. Compression is documented in CREATE EXTERNAL FUNCTION.

For information about timeouts and retries, see Account for timeout errors and Do not assume that the remote service is passed each row exactly once.

状态代码

响应还包含 HTTP 状态代码。Snowflake 可识别以下 HTTP 状态代码:

CodeDescription
200Batch processed successfully.
202Batch received and still being processed.

其他值将被视为错误。

响应创建示例

下面的示例 Python 代码返回正确的响应,包括 HTTP 响应代码、处理的数据和 MD5 标头(可选)。

此示例基于 AWS Lambda 函数。某些代码可能需要针对不同的平台进行自定义。

import json
import hashlib
import base64

def handler(event, context):

    # The return value should contain an array of arrays (one inner array
    # per input row for a scalar function).
    array_of_rows_to_return = [ ]

    ...

    json_compatible_string_to_return = json.dumps({"data" : array_of_rows_to_return})

    # Calculate MD5 checksum for the response
    md5digest = hashlib.md5(json_compatible_string_to_return.encode('utf-8')).digest()
    response_headers = {
        'Content-MD5' : base64.b64encode(md5digest)
    }

    # Return the HTTP status code, the processed data, and the headers
    # (including the Content-MD5 header).
    return {
        'statusCode': 200,
        'body': json_compatible_string_to_return,
        'headers': response_headers
    }