管理 Apache Iceberg™ 表

在 Snowflake 中管理 Apache Iceberg™ 表

您还可以将使用外部目录的 Iceberg 表转换为使用 Snowflake 作为 Iceberg 目录的表。要了解更多信息,请参阅 转换 Apache Iceberg™ 表以使用 Snowflake 作为目录

查询表

若要查询 Iceberg 表,用户必须获授予或继承以下权限:

  • 对包含该表的数据库和架构的 USAGE 权限

  • 对表的 SELECT 权限

您可以使用 SELECT 语句查询 Iceberg 表。例如:

SELECT col1, col2 FROM my_iceberg_table;
Copy

使用 DML 命令

使用 Snowflake 作为目录的 Iceberg 表支持完整 数据操作语言 (DML) 命令,包括以下内容:

Snowflake 管理的表还支持使用 COPY INTO <table>Snowpipe 等功能进行高效的批量加载。有关更多信息,请参阅 将数据加载到 Apache Iceberg™ 表中

备注

Snowflake 还支持写入外部托管的 Iceberg 表(预览版)。有关更多信息,请参阅 对外部托管的 Apache Iceberg™ 表的写入支持写入外部托管的 Iceberg 表

示例:更新表

您可以使用 INSERTUPDATE 语句来修改 Snowflake 管理的 Iceberg 表。

下面的示例将一个新值插入到名为 store_sales 的 Iceberg 表中,而且,如果 cola 列的值当前为 -99,则将该列的值更新为 1。

INSERT INTO store_sales VALUES (-99);

UPDATE store_sales
  SET cola = 1
  WHERE cola = -99;
Copy

生成 DML 变更的快照

对于使用 Snowflake 作为目录的表,Snowflake 会自动生成 Iceberg 元数据。Snowflake 将元数据写入外部卷上名为 metadata 的文件夹。要查找 metadata 文件夹,请参阅 数据和元数据目录

或者,您可以调用 SYSTEM$GET_ICEBERG_TABLE_INFORMATION 函数为新变更生成 Iceberg 元数据。

对于不受 Snowflake 管理的表,该函数将返回有关最新刷新快照的信息。

例如:

SELECT SYSTEM$GET_ICEBERG_TABLE_INFORMATION('db1.schema1.it1');
Copy

输出:

+-----------------------------------------------------------------------------------------------------------+
| SYSTEM$GET_ICEBERG_TABLE_INFORMATION('DB1.SCHEMA1.IT1')                                                   |
|-----------------------------------------------------------------------------------------------------------|
| {"metadataLocation":"s3://mybucket/metadata/v1.metadata.json","status":"success"}                         |
+-----------------------------------------------------------------------------------------------------------+

使用行级删除

备注

仅支持外部管理的 Iceberg 表。

Snowflake supports querying tables with row-level deletes and writing to tables by using row-level deletes.

Query tables

当您已为更新、删除和合并操作配置了 行级删除 (https://iceberg.apache.org/spec/#row-level-deletes) 时,Snowflake 支持对 外部管理的 Iceberg 表 进行查询。

要配置行级删除,请参阅 Apache Iceberg 文档中的 Write 属性 (https://iceberg.apache.org/docs/latest/configuration/#write-properties)。

Write to tables

Snowflake supports position row-level deletes for writing to externally managed Iceberg tables stored on Amazon S3, Azure, or Google Cloud. To turn off position deletes, which enable running the DML operations in copy-on-write mode, set the ENABLE_ICEBERG_MERGE_ON_READ parameter to FALSE at the table, schema, or database level.

写时复制与读时合并

Iceberg 提供了两种模式来配置计算引擎如何处理外部管理表的行级操作。Snowflake 支持这两种模式。

下表描述了何时可能需要使用每种模式:

模式

描述

写时复制(默认)

此模式优先考虑读取时间并影响写入速度。

当执行更新、删除或合并操作时,计算引擎会重写整个受影响的 Parquet 数据文件。这可能会导致写入速度变慢(尤其是在处理大型数据文件时),但不会影响读取时间。

这是默认模式。

读时合并

此模式优先考虑写入速度并稍微影响读取时间。

当执行更新、删除或合并操作时,计算引擎会创建一个删除文件,其中仅包含有关已更改行的信息。

当从表中读取时,查询引擎会将删除文件与数据文件合并。合并可能增加读取时间。但是,您可以通过安排定期压缩和表维护来优化读取性能。

要了解有关 Iceberg 行级更改的更多信息,请参阅 Apache Iceberg 文档中的 行级删除 (https://iceberg.apache.org/spec/#row-level-deletes)。

注意事项和限制

在对外部管理的 Iceberg 表使用行级删除时,请考虑以下事项:

  • Snowflake 仅支持 位置删除 (https://iceberg.apache.org/spec/#position-delete-files)。

  • 为了在使用行级删除时获得最佳读取性能,请执行定期压缩和表维护以删除旧的删除文件。有关信息,请参阅 维护使用外部目录的表

  • 过多的位置删除(尤其是悬空位置删除)可能会阻碍表的创建和刷新操作。要避免此问题,请执行表维护以删除多余的位置删除。

    使用的表维护方法取决于您的外部 Iceberg 引擎。例如,您可以使用带有 delete-file-thresholdrewrite-all 选项的 Spark rewrite_data_files 方法。有关更多信息,请参阅 Apache Iceberg™ 文档中的 rewrite_data_files (https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_data_files)。

Set a target file size

To improve query performance for external Iceberg engines such as Spark or Trino, you can configure a target file size for both Snowflake-managed and externally managed Iceberg tables with write support. You can either set a specific size (16MB, 32MB, 64MB, or 128MB), or use the AUTO option. AUTO works differently, depending on the table type:

  • Snowflake-managed tables: AUTO specifies that Snowflake should choose the file size for the table based on table characteristics such as size, DML patterns, ingestion workload, and clustering configuration. Snowflake automatically adjusts the file size, starting at 16 MB, for better read and write performance in Snowflake.

  • Externally managed tables: AUTO specifies that Snowflake should aggressively scale to a larger file size.

You can set the target file size when you create an Iceberg table, or run the ALTER ICEBERG TABLE command to change the target file size for an existing Iceberg table. Snowflake attempts to maintain file sizes close to the target size when writing Parquet files for a table.

After you set a target file size, Snowflake immediately starts to create larger files for new Data Manipulation Language (DML) operations. Snowflake's table maintenance operations asynchronously change the existing table files according to the target file size.

The following example uses TARGET_FILE_SIZE to set a target file size of 128 MB for a Snowflake-managed table:

CREATE ICEBERG TABLE my_iceberg_table (col1 INT)
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'my_external_volume'
  BASE_LOCATION = 'my_iceberg_table'
  TARGET_FILE_SIZE = '128MB';
Copy

Alternatively, use ALTER ICEBERG TABLE to set the TARGET_FILE_SIZE property for an existing table:

ALTER ICEBERG TABLE my_iceberg_table
  SET TARGET_FILE_SIZE = '32MB';
Copy

To check the value of the TARGET_FILE_SIZE property for a table, use the SHOW PARAMETERS command:

SHOW PARAMETERS LIKE 'target_file_size' FOR my_iceberg_table;
Copy

Table optimization for Snowflake-managed Iceberg tables

Table optimization automatically performs maintenance to improve the performance and reduce the storage costs of your Snowflake-managed Iceberg tables.

备注

To improve query performance, you can also set a target file size. For more information, see Set a target file size.

Snowflake supports the Iceberg table optimization features summarized in the following table:

Feature

Improves query performance

Reduces storage costs

Notes

Automatic Clustering [1]

  • Billed.

  • Disabled by default.

Data compaction

  • Billed.

  • Enabled by default.

Manifest compaction

  • No cost.

  • Enabled automatically; you can't disable it.

Orphan file deletion

  • No cost.

  • Enabled automatically; you can't disable it.

Snapshot expiry

  • No cost.

  • Enabled automatically; you can't disable it.

[1] Unlike the other table optimization features, Automatic Clustering is billed separately as a standalone feature.

Automatic Clustering

Automatic Clustering reorganizes data within files or partitions based on frequently queried columns. The file size for Iceberg tables is based on your clustering configuration, unless you set a target file size. If you do, the file size is the specific size you set. For more information, see Set a target file size.

To set Automatic Clustering, specify the CLUSTER BY parameter when you create a Snowflake-managed Iceberg table or modify an existing table. For more information, see:

For more information about Automatic Clustering, see 自动聚类.

Data compaction

Data compaction combines small files into larger, more efficient files to manage storage, maintain an optimal file size, and improve query performance.

In most cases, data compaction doesn't have a significant effect on compute costs, but if these costs are a concern, you can disable compaction. For example, you might want to disable compaction on a table if you rarely query it. To disable or enable data compaction, see Set data compaction.

备注

  • To query data compaction jobs for Iceberg tables, see ICEBERG_STORAGE_OPTIMIZATION_HISTORY 视图. This view includes the number of credits that are billed for data compaction.

  • If you have 自动聚类 enabled, clustering performs data compaction on the table. This is true, regardless of whether data compaction is enabled or disabled on the table.

  • You also have the option to set a target file size. For more information, see Set a target file size.

Manifest compaction

Manifest compaction optimizes the metadata layer by reorganizing and combining smaller manifest files. This compaction reduces metadata overhead and improves query performance.

This feature is enabled automatically and you can't disable it.

Orphan file deletion

Orphan file deletion systematically identifies and removes data and metadata files that exist in the underlying storage but are no longer referenced by any valid table snapshot.

This feature is enabled automatically and you can't disable it.

Snapshot expiry

Snapshot expiry systematically deletes old snapshots and their unique data and metadata files from the table's history. This deletion is based on predefined retention policies.

This feature is enabled automatically and you can't disable it.

维护使用外部目录的表

您可以对使用外部目录的 Iceberg 表执行维护操作。

维护操作包括以下内容:

  • 使快照过期

  • 移除旧的元数据文件

  • 压缩数据文件

重要

为了使您的 Iceberg 表与外部更改保持同步,请务必将您的 Snowflake 刷新计划与表维护保持一致。每次执行维护操作时都应 刷新表

要了解不受 Snowflake 管理的 Iceberg 表的维护信息,请参阅 Apache Iceberg 文档中的 维护 (https://iceberg.apache.org/docs/latest/maintenance/)。

刷新表元数据

当您使用外部 Iceberg 目录时,可以使用 ALTER ICEBERG TABLE ... REFRESH 命令刷新表元数据。刷新表元数据会将元数据与最近的表更改同步。

备注

我们建议为支持的外部管理表设置 自动刷新

刷新表的元数据

以下示例手动刷新使用外部目录(例如,AWS Glue 或 Delta)的表的元数据。刷新表可使表与远程目录中发生的所有更改保持同步。

使用这种类型的 Iceberg 表时,您不需要在命令中指定元数据文件路径。

ALTER ICEBERG TABLE my_iceberg_table REFRESH;
Copy

要保持表自动更新,您可以设置 自动刷新。使用 ALTER ICEBERG TABLE 命令。

例如:

ALTER ICEBERG TABLE my_iceberg_table SET AUTO_REFRESH = TRUE;
Copy

刷新从 Iceberg 文件创建的表的元数据

以下示例手动刷新从外部云存储位置中的 Iceberg 元数据文件 创建的表,并指定元数据文件的相对路径,不带前导正斜杠 (/)。元数据文件定义刷新后表中的数据。

ALTER ICEBERG TABLE my_iceberg_table REFRESH 'metadata/v1.metadata.json';
Copy

检索存储指标

Snowflake 不会向您的账户收取 Iceberg 表存储费用。但是,您可以通过 Snowflake Information SchemaAccount Usage 架构中的 TABLE_STORAGE_METRICS 和 TABLES 视图进行查询,以跟踪 Iceberg 表占用了多少存储空间。

下面的示例查询连接了 ACCOUNT_USAGE.TABLE_STORAGE_METRICS 视图和 ACCOUNT_USAGE.TABLES 视图,并根据 TABLES.IS_ICEBERG 列进行筛选。

SELECT metrics.* FROM
  snowflake.account_usage.table_storage_metrics metrics
  INNER JOIN snowflake.account_usage.tables tables
  ON (
    metrics.id = tables.table_id
    AND metrics.table_schema_id = tables.table_schema_id
    AND metrics.table_catalog_id = tables.table_catalog_id
  )
  WHERE tables.is_iceberg='YES';
Copy

Set data compaction

You can set data compaction on Snowflake-managed Iceberg tables when you create a database, schema, or table, or run the ALTER command to change the setting for an existing database, schema, or table. You can also set data compaction at the account level by using the ALTER ACCOUNT command. For more information about data compaction, see Data compaction.

The following example uses ENABLE_DATA_COMPACTION to disable data compaction for a Snowflake-managed table:

CREATE ICEBERG TABLE my_iceberg_table (col1 INT)
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'my_external_volume'
  BASE_LOCATION = 'my_iceberg_table'
  ENABLE_DATA_COMPACTION = FALSE;
Copy

Alternatively, use ALTER ICEBERG TABLE to disable it for an existing table.

ALTER ICEBERG TABLE my_iceberg_table
  SET ENABLE_DATA_COMPACTION = FALSE;
Copy

For more information, see:

语言: 中文