自动刷新 Apache Iceberg™ 表

为新的或现有的外部管理 Apache Iceberg™ 表 配置自动元数据刷新。通过自动刷新,Snowflake 以连续和无服务器形式轮询您的外部 Iceberg 目录,以将元数据与最新的远程变更同步。

Iceberg 表的自动刷新与目录表或外部表的自动刷新工作方式不同,因为它不依赖云服务提供商的通知。相反,您可以按照以下步骤配置该功能:

  1. 设置目录集成的刷新间隔。Snowflake 支持以下外部 Iceberg 目录选项的自动刷新:

    • 符合 Apache Iceberg REST OpenAPI 规范的 REST 目录

    • Snowflake Open Catalog

    • 对象存储(仅限 Delta Lake)

    • AWS Glue

  2. 创建一个或多个使用该目录集成的 Iceberg 表

  3. 通过 AUTO_REFRESH 参数控制每个表的自动刷新

这种方法允许您通过目录集成集中管理刷新设置,同时根据需要对各个表进行单独控制。

设置目录集成的刷新间隔

运行 CREATE CATALOG INTEGRATION 命令时,可以为 REFRESH_INTERVAL_SECONDS 参数指定值。否则,默认刷新间隔为 30 秒。Snowflake 只在目录集成定义了 Iceberg 表的情况下轮询外部目录。

以下示例为 AWS Glue 创建目录集成,并指定 60 秒的刷新间隔:

CREATE CATALOG INTEGRATION auto_refresh_catalog_integration
  CATALOG_SOURCE = GLUE
  CATALOG_NAMESPACE = 'my_catalog_namespace'
  TABLE_FORMAT = ICEBERG
  GLUE_AWS_ROLE_ARN = 'arn:aws:iam::123456789123:role/my-catalog-role'
  GLUE_CATALOG_ID = '123456789123'
  ENABLED = TRUE
  REFRESH_INTERVAL_SECONDS = 60;
Copy

要更新目录集成的刷新间隔,请使用 ALTER CATALOG INTEGRATION 命令。

例如:

ALTER CATALOG INTEGRATION auto_refresh_catalog_integration SET REFRESH_INTERVAL_SECONDS = 120;
Copy

创建一个自动刷新的 Iceberg 表

使用 CREATE ICEBERG TABLE 命令创建 Iceberg 表。要指定表应使用自动元数据刷新,请设置 AUTO_REFRESH = TRUE

下面的示例创建一个使用 AWS Glue 作为目录的 Iceberg 表,指定前面创建的目录集成 (auto_refresh_catalog_integration) 和来自 AWS Glue 的 CATALOG_TABLE_NAME

CREATE OR REPLACE ICEBERG TABLE auto_refresh_iceberg_table
  CATALOG_TABLE_NAME = 'myGlueTable'
  CATALOG = 'auto_refresh_catalog_integration'
  AUTO_REFRESH = TRUE;
Copy

启用或关闭自动刷新

备注

  • 如果表使用 Snowflake 8.22 版之前创建的目录集成,则必须先使用 ALTER CATALOG INTEGRATION 命令设置 REFRESH_INTERVAL_SECONDS 参数,然后再对表启用自动刷新。

  • 频繁地打开和关闭 Iceberg 表的自动刷新会降低表的元数据刷新速度。

使用 ALTER ICEBERG TABLE 命令启用或关闭现有 Iceberg 表的自动刷新。

例如:

ALTER ICEBERG TABLE my_iceberg_table SET AUTO_REFRESH = FALSE;
Copy

监控自动刷新状态

SHOW ICEBERG TABLES

要获取多个表的自动刷新状态,请使用 SHOW ICEBERG TABLES 命令。

SHOW ICEBERG TABLES;
Copy

命令输出包括一个名为 auto_refresh_status 的列,该列显示的信息与您具有访问权限的每个表的 SYSTEM$AUTO_REFRESH_STATUS 函数相同。

SYSTEM$AUTO_REFRESH_STATUS

要检索特定表的自动刷新状态,请调用 SYSTEM$AUTO_REFRESH_STATUS 函数。

SELECT SYSTEM$AUTO_REFRESH_STATUS('my_iceberg_table');
Copy

该函数返回 Snowflake 用于自动刷新表的管道的详细信息,例如快照队列的执行状态和大小。RUNNING 的执行状态指示自动刷新正在按预期运行。有关更多信息,请参阅 SYSTEM$AUTO_REFRESH_STATUS

ICEBERG_TABLE_SNAPSHOT_REFRESH_HISTORY

要检索有关特定表的最新刷新历史记录的元数据和快照信息,请使用 ICEBERG_TABLE_SNAPSHOT_REFRESH_HISTORY 函数。

SELECT *
FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLE_SNAPSHOT_REFRESH_HISTORY(
  TABLE_NAME => 'my_iceberg_table'
));
Copy

Monitor automated refresh events

You can configure Snowflake to record an event that provides information about the status of automated refresh for an Iceberg table. Snowflake records the event in the event table for your account. You can't set it at the database or schema level. When Snowflake automatically refreshes an Iceberg table in your account, Snowflake records an event to your event table.

Monitoring automated refresh events can help you gain insight into the following areas:

  • Automated refresh progress: Track how snapshots move through the automated refresh process.

  • Aggregated statistics: Review summarized statistics for automated refresh operations.

You can also configure alerts for the following critical conditions:

  • Refresh errors

  • High refresh latencies

备注

Logging events for automated refresh incurs costs. For more information, see 遥测数据收集成本.

Snowflake records an event when automated refresh starts, completes, or results in error.

Set the severity level to capture events

To capture automated refresh events, you must set the LOG_LEVEL parameter at the Iceberg table level or account level. The LOG_LEVEL determines which events to capture based on the following values:

  • ERROR: Events that signal a change requiring human intervention to resolve.

  • WARN: Events that signal an issue that can be resolved without human intervention.

  • DEBUG: High-volume events.

备注

There is no default severity level. To capture events, you must set the severity level at either the account level or Iceberg table level.

For example, to capture DEBUG-level automated refresh events for a specific Iceberg table, use the following command:

ALTER ICEBERG TABLE <my_table_name> SET LOG_LEVEL = DEBUG;
Copy

有关更多信息,请参阅 为日志、指标和跟踪设置级别

Query your event table for automated refresh events

Before you can query for automated refresh events, you must set up an event table and set the severity level for event capture.

The following example shows how to retrieve Iceberg automated refresh events that are generated during snapshot processing:

SELECT record_type,
       record:"name" event_name,
       record:"severity_text" log_level,
       resource_attributes:"snow.database.name" database_name,
       resource_attributes:"snow.schema.name" schema_name,
       resource_attributes:"snow.table.name" table_name,
       resource_attributes:"snow.catalog.integration.name" catalog_integration_name,
       record_attributes:"snow.snapshot.id" snapshot_id,
       parse_json(value):metadata_file_location metadata_file_location,
       parse_json(value):snapshot_state snapshot_state
       parse_json(value):created created
       parse_json(value):detected detected
  FROM my_active_event_table
  WHERE record_type='EVENT' AND event_name='iceberg_auto_refresh_snapshot_lifecycle';
Copy

输出:

+-------------+-----------------------------------------+-----------+---------------+-------------+------------+--------------------------+---------------+------------------------+----------------+--------------------------+--------------------------+
| RECORD_TYPE | EVENT_NAME                              | LOG_LEVEL | DATABASE_NAME | SCHEMA_NAME | TABLE_NAME | CATALOG_INTEGRATION_NAME | SNAPSHOT_ID   | METADATA_FILE_LOCATION | SNAPSHOT_STATE | CREATED                  | DETECTED                 |
+-------------+-----------------------------------------+-----------+---------------+-------------+------------+--------------------------+---------------+------------------------+----------------+--------------------------+--------------------------+
| EVENT       | iceberg_auto_refresh_snapshot_lifecycle | DEBUG     | TESTDB        | TESTSH      | TESTTABLE  | glue_integration         | 4281775564368 | metadata.json          | detected       | 2025-05-21T20:38:06.822Z | 2025-10-17T21:01:49.695Z |
| EVENT       | iceberg_auto_refresh_snapshot_lifecycle | DEBUG     | TESTDB        | TESTSH      | TESTTABLE  | glue_integration         | 4281775564368 | metadata.json          | started        | 2025-05-21T20:38:06.822Z | 2025-10-17T21:01:49.695Z |
| EVENT       | iceberg_auto_refresh_snapshot_lifecycle | DEBUG     | TESTDB        | TESTSH      | TESTTABLE  | glue_integration         | 4281775564368 | metadata.json          | completed      | 2025-05-21T20:38:06.822Z | 2025-10-17T21:01:49.695Z |
+-------------+-----------------------------------------+-----------+---------------+-------------+------------+--------------------------+---------------+------------------------+----------------+--------------------------+--------------------------+

Query your event table for stale automated refresh events

You can query your event table for tables whose last successful refresh is older than a threshold you define.

  1. The following example defines a threshold of 20 minutes:

    SET STALENESS_THRESHOLD_MINUTES = 20;
    
    Copy
  2. Query for tables whose last successful refresh is older than the threshold you defined:

    WITH last_successful_refresh AS (
      -- Find the most recent 'completed' event for each table
      SELECT
        resource_attributes:"snow.table.name"::STRING AS table_name,
        MAX(timestamp) AS last_success_timestamp
      FROM
        <my_active_event_table>
      WHERE
        record:"name" = 'iceberg_auto_refresh_snapshot_lifecycle'
        AND parse_json(value):snapshot_state::STRING = 'completed'
      GROUP BY
        table_name
    )
    
    -- Select tables whose the last successful refresh was longer ago than our threshold
    SELECT
      table_name,
      last_success_timestamp
    FROM
      last_successful_refresh
    WHERE
      last_success_timestamp < DATEADD(minute, -$STALENESS_THRESHOLD_MINUTES, CURRENT_TIMESTAMP())
    ORDER BY
      last_success_timestamp ASC;
    
    Copy
  • Where my_active_event_table is your active event table.

输出:

+------------+-------------------------+
| TABLE_NAME | LAST_SUCCESS_TIMESTAMP  |
+------------+-------------------------+
| my_table   | 2025-10-10 07:24:30.854 |
+------------+-------------------------+

错误恢复

当自动刷新过程中发生错误时,Snowflake会将执行状态更新为以下值之一:

  • STALLED 表示Snowflake正在尝试从错误中恢复。如果恢复成功,自动刷新进程将继续按预期运行,执行状态将转换回正常的 RUNNING 状态。

  • STOPPED 表示自动刷新进程遇到了不可恢复的错误,并且表的自动刷新已停止。

    例如,当 Snowflake 无法在目标快照和当前快照之间建立直接沿袭时,可能会发生不可恢复的错误。

    要从 STOPPED 状态恢复,请执行以下操作:

    1. 关闭对表的自动刷新

    2. 执行手动元数据刷新。有关说明,请参阅 刷新表元数据

    3. 使用 ALTER ICEBERGTABLE 重新启用自动刷新...SET AUTO_REFRESH 语句重命名数据库角色。

    4. 通过调用 RUNNING 函数验证自动刷新是否处于 SYSTEM$AUTO_REFRESH_STATUS 状态。您还可以多次调用该函数,以确认排队快照的数量 (pendingSnapshotCount) 逐渐减少。

计费

Snowflake uses Snowpipe to automate refreshes for Iceberg tables, so charges for automated refresh appear in the same line item on your bill as Snowpipe charges. Using events to monitor automated refresh also incurs cost. For more information, see 遥测数据收集成本.

There are no Snowpipe file charges for this feature.

您可以通过检查 Account Usage PIPE_USAGE_HISTORY 视图 来估算产生的费用,pipe_name 列中显示了 Iceberg 表名。

对于基于 Delta 的 Iceberg 表,自动刷新管道会显示 NULL 管道名称。

有关 Iceberg 表收费的更多信息,请参阅 Iceberg 表计费

注意事项和限制

当您使用利用自动刷新的 Iceberg 表时,请考虑以下事项:

  • 对于在 Snowflake 8.22 版之前创建的目录集成(基于 Delta 表的目录集成对应版本为 9.2 之前),必须先手动设置 REFRESH_INTERVAL_SECONDS 参数,然后才能对依赖于该目录集成的表启用自动刷新。有关说明,请参阅 ALTER CATALOG INTEGRATION ... SET AUTO_REFRESH

  • 对于 对象存储的目录集成,只有与 TABLE_FORMAT = DELTA 的集成才支持自动刷新。

  • 对于频繁更新的表,使用较短的轮询间隔 (REFRESH_INTERVAL_SECONDS) 可能会导致性能下降。

语言: 中文