Storage for Apache Iceberg™ tables¶
Snowflake tables typically use storage that Snowflake manages. In contrast, Apache Iceberg™ tables in Snowflake use external storage that you configure and maintain.
本主题提供 Iceberg 表存储的概念信息和最佳实践。
外部卷
Note
To connect Snowflake to your external cloud storage for Iceberg tables without using an external volume, use catalog-vended credentials. This option is only available for externally managed Iceberg tables. For more information, see Use catalog-vended credentials for Apache Iceberg™ tables.
An external volume is a named, account-level Snowflake object that you use to connect Snowflake to your external cloud storage for Iceberg tables. An external volume stores an identity and access management (IAM) entity for your storage location. Snowflake uses the IAM entity to securely connect to your storage for accessing table data, Iceberg metadata, and manifest files that store the table schema, partitions, and other metadata.
A single external volume can support one or more Iceberg tables.
Each external volume is associated with a particular Active storage location, and a single external volume can support multiple Iceberg tables. However, the number of external volumes you need depends on how you want to store, organize, and secure your table data.
You can use a single external volume if you want the data and metadata for all of your Snowflake-Iceberg tables in subdirectories under the same storage location (for example, in the same S3 bucket). To configure these directories for Snowflake-managed tables, see Data and metadata directories.
Alternatively, you can create multiple external volumes to secure various storage locations differently. For example, you might create the following external volumes:
- A read-only external volume for externally managed Iceberg tables.
- An external volume configured with read and write access for Snowflake-managed tables.
授予 Snowflake 对存储的访问权限¶
云提供商存储
要授予 Snowflake 对 Iceberg 表云存储位置的访问权限,请使用云提供商的 Identity and Access Management 服务。您授予身份或主体对您存储的有限访问权限,而无需交换密钥。这与 Snowflake 用于其他集成(包括存储集成)的访问模式相同。
Snowflake provisions a principal for your entire Snowflake account when you create an external volume. The principal is as follows, depending on your cloud provider:
| Cloud provider | Snowflake-provisioned principal |
|---|---|
| Amazon Web Services (AWS) | IAM user (https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements_principal.html#principal-users) |
| Google Cloud | Service account (https://cloud.google.com/iam/docs/overview#service_account) |
| Azure | Service principal (https://learn.microsoft.com/en-us/entra/identity-platform/app-objects-and-service-principals?tabs=browser#service-principal-object) |
Snowflake 直接与您的存储提供商进行身份验证,Snowflake 提供的主体将承担您指定的角色。该角色必须拥有在存储位置上执行操作的权限。例如,只有当角色拥有从存储位置读取的权限时,Snowflake 才能从该存储位置读取数据。
Snowflake 需要权限才能对 Iceberg 表执行以下操作:
| Snowflake-managed tables | Tables that use an external Iceberg catalog | |
|---|---|---|
| Amazon S3 |
|
|
| Google Cloud Storage |
|
|
| Azure Storage | All allowed actions for the Storage Blob Data Contributor role (https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/storage#storage-blob-data-contributor) | All allowed actions for the Storage Blob Data Reader role (https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/storage#storage-blob-data-reader) |
Note
For Iceberg tables created from Delta table files (Delta Direct) with read-only requirements, grant only the read actions for your cloud provider (for example s3:GetObject without s3:PutObject on Amazon S3) and set ALLOW_WRITES to FALSE on the external volume so that permissions match Snowflake behavior. For provider-specific action lists, see Read-only vs write access for Delta Direct on each storage provider.
Note
The s3:PutObject permission grants write access to the external volume location.
To completely configure write access, you must set the ALLOW_WRITES parameter of the external volume to TRUE (the default value).
有关授予 Snowflake 对 Iceberg 表存储的访问权限的完整说明,请参阅以下主题:
- Configure an external volume for Amazon S3
- Configure an external volume for Google Cloud Storage
- Configure an external volume for Azure
兼容 S3 的存储¶
To grant Snowflake access to an S3-compatible storage location for Iceberg tables, you specify an S3-compatible storage endpoint with credentials when you create an external volume.
For instructions, see Configure an external volume for S3-compatible storage.
活动存储位置
Each external volume supports a single active storage location. If you specify multiple storage locations in a CREATE EXTERNAL VOLUME statement, Snowflake assigns one location as the active location. The active location remains the same for the lifetime of the external volume.
首次在 CREATE ICEBERG TABLE 语句中使用外部卷时,系统就会进行分配。Snowflake 使用以下逻辑选择活动位置:
- If the
STORAGE_LOCATIONSlist contains one or more local storage locations, Snowflake uses the first local storage location in the list. A local storage location is one with the same cloud provider and in the same region as your Snowflake account. - If the
STORAGE_LOCATIONSlist does not contain any local storage locations, Snowflake selects the first location in the list.
Note
在 Snowflake 版本 7.44 之前创建的外部卷可能使用不同的逻辑来选择活动位置。
验证存储访问
Note
To verify storage access by using Snowsight, see Verify an external volume by using Snowsight
To check that Snowflake can successfully authenticate to your storage provider, call the SYSTEM$VERIFY_EXTERNAL_VOLUME function.
Note
If you receive the following error, your account administrator must activate AWS STS in the Snowflake deployment region. For instructions, see Manage AWS STS in an AWS Region (https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html) in the AWS documentation.
For Snowflake-managed tables, Snowflake automatically verifies access to the active storage location on your external volume in the following situations:
- 首次在 Snowflake 管理的表的 CREATE ICEBERG TABLE 语句中指定该外部卷。
- 首次将表转换为使用 Snowflake 作为 Iceberg 目录。
The ALLOW_WRITES property of the external volume must be set to TRUE.
Snowflake 会尝试以下存储操作来验证存储位置。
- 编写测试文件。
- 读取文件。
- 列出文件路径的内容。
- 删除文件。
如果任何一项操作失败,则 CREATE ICEBERG TABLE(或 ALTER ICEBERG TABLE … CONVERT TO MANAGED)语句会失败并收到错误消息。
文件管理
本部分将根据 Iceberg 表的类型说明如何管理存储中的 Iceberg 表文件。
Snowflake 管理的表¶
Important
- 不允许其他工具访问删除或覆盖与 Snowflake 管理的 Iceberg 表相关联的对象。
- Ensure that the Snowflake principal maintains access to your table storage. For more information, see 授予 Snowflake 对存储的访问权限.
虽然 您 可以配置和管理 Iceberg 表的存储位置,但 Snowflake 只对存储中属于 Snowflake 管理的表的对象(数据和元数据文件)进行操作。Snowflake 会定期对这些表对象进行维护,以优化查询性能并清理已删除的数据。
如果其他工具删除或覆盖 Snowflake 管理的表对象,查询可能会失败。同样,如果您撤销 Snowflake 主体对存储的访问权限,对表的查询和 Snowflake 的表维护操作也会失败。
当 Snowflake 删除 Snowflake 管理的表数据或删除表时,Snowflake 会在表保留期到期后删除对象。
To configure replication for Snowflake-managed Iceberg tables, see Configure replication for Snowflake-managed Iceberg tables.
数据和元数据目录
This section describes the data and metadata directories for Snowflake-managed tables.
These directories can either be organized in a flat or hierarchical layout:
Note
To find the data and metadata directories for any Iceberg table, you can use the SHOW ICEBERG TABLES command.
The command output includes a base_location property that indicates the location of each table’s data and metadata files.
Flat layout¶
This section describes the flat layout in Snowflake for data and metadata directories for Snowflake-managed tables.
When you create a Snowflake-managed table that uses the default flat directory layout (PATH_LAYOUT = FLAT), Snowflake writes all Parquet
data files under a single data/ directory and all table metadata data files under a single metadata/ directory. Snowflake also writes
metadata for Delta-based tables.
Snowflake constructs paths using the following patterns, depending on the values specified
for BASE_LOCATION or
the BASE_LOCATION_PREFIX parameter.
If you specify a BASE_LOCATION, Snowflake does not use the BASE_LOCATION_PREFIX in the path.
其中:
STORAGE_BASE_URLis the base URL for the active storage location associated with your external volume.BASE_LOCATIONis the path for a directory where Snowflake should write the table files (specified in CREATE ICEBERG TABLE), relative to your external volume location. Specifying a BASE_LOCATION is required for Delta-based tables.randomIdis a random, Snowflake-generated 8-character string.
| BASE_LOCATION defined | BASE_LOCATION_PREFIX defined | Path |
|---|---|---|
| No | No | STORAGE_BASE_URL/database/schema/table_name.randomId/[data | metadata]/ |
| No | Yes | STORAGE_BASE_URL/BASE_LOCATION_PREFIX/table_name.randomId/[data | metadata]/ |
| Yes | N/A (ignored) | STORAGE_BASE_URL/BASE_LOCATION.randomId/[data | metadata]/ |
| ‘’ (empty string) | N/A (ignored) | STORAGE_BASE_URL/randomId/[data | metadata]/ |
** 使用 BASE_LOCATION 整理表存储**
Note
如果您打算将来重命名表,则不建议使用此选项。
After you create a Snowflake-managed table, the path to its files in external storage does not change, even if you rename the table.
To organize files in storage for multiple Iceberg tables under the same STORAGE_BASE_URL,
consider using the table name as the BASE_LOCATION in your CREATE ICEBERG TABLE statement. This way, Snowflake writes data and
metadata to a directory that includes the name of the table.
例如:
该语句会在外部云存储中产生以下目录结构:
Hierarchical layout¶
This section describes the hierarchical layout in Snowflake for data and metadata directories for Snowflake-managed tables.
When you create a Snowflake-managed table that uses the default flat directory layout (PATH_LAYOUT = HIERARCHICAL), Snowflake writes all
Parquet data files by organizing them in a hierarchical directory structure under the data/ directory that is based on transforms that
you define when you create a
table. For instructions on how to enable this layout, see Partitioning with hierarchical paths. Snowflake writes all
table metadata data files under a single metadata/ directory.
Snowflake constructs paths using the following patterns, depending on the values specified
for BASE_LOCATION or
the BASE_LOCATION_PREFIX parameter.
If you specify a BASE_LOCATION, Snowflake does not use the BASE_LOCATION_PREFIX in the path.
其中:
STORAGE_BASE_URLis the base URL for the active storage location associated with your external volume.BASE_LOCATIONis the path for a directory where Snowflake should write the table files (specified in CREATE ICEBERG TABLE), relative to your external volume location. Specifying a BASE_LOCATION is required for Delta-based tables.randomIdis a random, Snowflake-generated 8-character string.
| BASE_LOCATION defined | BASE_LOCATION_PREFIX defined | Path |
|---|---|---|
| No | No | STORAGE_BASE_URL/database/schema/table_name.randomId/[data/<hierarchical_layout> | metadata]/ |
| No | Yes | STORAGE_BASE_URL/BASE_LOCATION_PREFIX/table_name.randomId/[data/<hierarchical_layout> | metadata]/ |
| Yes | N/A (ignored) | STORAGE_BASE_URL/BASE_LOCATION.randomId/[data/<hierarchical_layout> | metadata]/ |
| ‘’ (empty string) | N/A (ignored) | STORAGE_BASE_URL/randomId/[data/<hierarchical_layout> | metadata]/ |
** 使用 BASE_LOCATION 整理表存储**
Note
如果您打算将来重命名表,则不建议使用此选项。
After you create a Snowflake-managed table, the path to its files in external storage does not change, even if you rename the table.
To organize files in storage for multiple Iceberg tables under the same STORAGE_BASE_URL,
consider using the table name as the BASE_LOCATION in your CREATE ICEBERG TABLE statement. This way, Snowflake writes data and
metadata to a directory that includes the name of the table.
The following example creates customer_region_summary and orders_by_status tables, which each use a hierarchical path layout
for their data files based on the following transforms:
- The
customer_region_summarytable is partitioned byregion - the
orders_by_statustable is partitioned byorder_status
该语句会在外部云存储中产生以下目录结构:
使用外部目录的表
Snowflake doesn’t write or delete storage objects for externally managed Iceberg tables
or on external volumes with the ALLOW_WRITES property set to FALSE.
For external catalogs that you connect to with an external volume, to access your table data and metadata, Snowflake assumes the access control role that you configure for your external volume. You grant the role permission to access a storage location (in a bucket or container). All of your table data and metadata files must be in that location. For example, if your storage location is an S3 bucket, all of your data and metadata files must exist somewhere in that bucket.
For external catalogs that you connect to by using catalog-vended credentials, Snowflake obtains short-lived, scoped credentials from the external catalog that allow Snowflake access only to the paths that store the table’s data and metadata. For more information, see Use catalog-vended credentials for Apache Iceberg™ tables.
Additionally, converting a table does not rewrite any data or metadata files. Snowflake writes to an Iceberg table only after you convert a table to use Snowflake as the catalog.
数据和元数据目录
This section describes the data and metadata directories for externally managed tables that you create in a catalog-linked database. These directories can either be organized in a flat or hierarchical layout:
Note
- To find the data and metadata directories for any Iceberg table that you specified a
base_locationfor when you created it, you can use the SHOW ICEBERG TABLES command. The command output includes abase_locationproperty that indicates the location of each table’s data and metadata files. - For externally managed tables in a standard Snowflake database, Snowflake infers the location of the table from the remote catalog
metadata and then writes to the
/datadirectory for the table.
Catalog-linked database: Flat layout¶
This section describes the flat layout for data and metadata directories for externally managed Iceberg tables that you create in a catalog-linked database.
When you create an externally managed table in a catalog-linked database that uses the default flat directory layout (PATH_LAYOUT = FLAT),
Snowflake writes all Parquet data files under a single data/ directory and all table metadata data files under a
single metadata/ directory.
Snowflake constructs paths using the following patterns, depending on the values specified
for BASE_LOCATION or
the BASE_LOCATION_PREFIX parameter.
If you specify a BASE_LOCATION, Snowflake does not use the BASE_LOCATION_PREFIX in the path.
Note
The BASE_LOCATION_PREFIX parameter is only supported when you use an external volume to connect to your catalog-linked database.
The BASE_LOCATION_PREFIX parameter isn’t supported when you use catalog-vended credentials to connect to your catalog-linked database.
其中:
STORAGE_BASE_URLis the base URL for the active storage location associated with your external volume or vended credentials.BASE_LOCATIONis the path for a directory where Snowflake should write the table files (specified in CREATE ICEBERG TABLE), relative to your external volume location. If you’re using catalog-vended credentials, this must be an absolute path that points to an allowed location defined by the remote catalog. Specifying a BASE_LOCATION is required for Delta-based tables.randomIdis a random, Snowflake-generated 8-character string.
| BASE_LOCATION defined | BASE_LOCATION_PREFIX defined | Path |
|---|---|---|
| No | No | STORAGE_BASE_URL/database/schema/table_name/[data | metadata]/ |
| No | Yes | STORAGE_BASE_URL/BASE_LOCATION_PREFIX/table_name.randomId/[data | metadata]/ |
| Yes | N/A (ignored) | STORAGE_BASE_URL/BASE_LOCATION.randomId/[data | metadata]/ |
** 使用 BASE_LOCATION 整理表存储**
To organize files in storage for multiple Iceberg tables under the same STORAGE_BASE_URL,
consider using the table name as the BASE_LOCATION in your CREATE ICEBERG TABLE statement. This way, Snowflake writes data and
metadata to a directory that includes the name of the table.
例如:
该语句会在外部云存储中产生以下目录结构:
Catalog-linked database: Hierarchical layout¶
This section describes the hierarchical layout for data and metadata directories for externally managed Iceberg tables that you create in a catalog-linked database.
When you create an externally managed Iceberg table in a catalog-linked database that uses the default flat directory layout
(PATH_LAYOUT = HIERARCHICAL), Snowflake writes all
Parquet data files by organizing them in a hierarchical directory structure under the data/ directory that is based on transforms that you
define when you create a table. For instructions on
how to enable this layout, see Partitioning with hierarchical paths. Snowflake writes all table metadata data files
under a single metadata/ directory.
Note
If you set PATH_LAYOUT = HIERARCHICAL without specifying a PARTITION BY clause, Snowflake uses the flat layout for the table. However, if you later enable partitioning on the table, Snowflake begins using a hierarchical layout with partitioned writes. For more information, see Partitioning with hierarchical paths.
For externally managed tables with a hierarchical layout, Snowflake writes Parquet data files and table metadata to your external cloud storage. The Parquet data files are organized in a hierarchical directory structure that is based on transforms that you define when you create a table.
Snowflake constructs paths using the following patterns, depending on the values specified
for BASE_LOCATION or
the BASE_LOCATION_PREFIX parameter.
If you specify a BASE_LOCATION, Snowflake does not use the BASE_LOCATION_PREFIX in the path.
Note
The BASE_LOCATION_PREFIX parameter is only supported when you use an external volume to connect to your catalog-linked database.
The BASE_LOCATION_PREFIX parameter isn’t supported when you use catalog-vended credentials to connect to your catalog-linked database.
其中:
STORAGE_BASE_URLis the base URL for the active storage location associated with your external volume or vended credentials.BASE_LOCATIONis the path for a directory where Snowflake should write the table files (specified in CREATE ICEBERG TABLE), relative to your external volume location. If you’re using catalog-vended credentials, this must be an absolute path that points to an allowed location defined by the remote catalog. Specifying a BASE_LOCATION is required for Delta-based tables.randomIdis a random, Snowflake-generated 8-character string.
| BASE_LOCATION defined | BASE_LOCATION_PREFIX defined | Path |
|---|---|---|
| No | No | STORAGE_BASE_URL/database/schema/table_name.randomId/[data/<hierarchical_layout> | metadata]/ |
| No | Yes | STORAGE_BASE_URL/BASE_LOCATION_PREFIX/table_name.randomId/[data/<hierarchical_layout> | metadata]/ |
| Yes | N/A (ignored) | STORAGE_BASE_URL/BASE_LOCATION.randomId/[data/<hierarchical_layout> | metadata]/ |
** 使用 BASE_LOCATION 整理表存储**
To organize files in storage for multiple Iceberg tables under the same STORAGE_BASE_URL,
consider using the table name as the BASE_LOCATION in your CREATE ICEBERG TABLE statement. This way, Snowflake writes data and
metadata to a directory that includes the name of the table.
The following example creates customer_region_summary and orders_by_status tables, which each use a hierarchical path layout
for their data files based on the following transforms:
- The
customer_region_summarytable is partitioned byregion - the
orders_by_statustable is partitioned byorder_status
该语句会在外部云存储中产生以下目录结构:
启用存储访问日志
要诊断问题并审核对外部卷相关存储位置的访问,可以启用存储日志记录。存储日志可帮助您确定文件丢失或损坏的原因。
启用存储提供商的日志记录。由于您拥有并管理 Iceberg 表的存储,Snowflake 无法在您的 Iceberg 存储位置启用日志或审核。
要了解存储提供商的存储访问日志,请参阅以下外部主题:
- Logging options for Amazon S3 (https://docs.aws.amazon.com/AmazonS3/latest/userguide/logging-with-S3.html)
- Usage logs and storage logs for Google Cloud Storage (https://cloud.google.com/storage/docs/access-logs)
- Azure Storage analytics logging (https://learn.microsoft.com/en-us/azure/storage/common/storage-analytics-logging)
利用版本控制和对象保留功能保护文件
如果您的 Iceberg 表数据位于中央数据存储库(或数据湖)中,并由多个工具和服务操作,则可能会发生意外删除或损坏。要保护 Iceberg 表数据并确保意外删除或覆盖数据的检索,请使用存储提供商提供的存储生命周期管理和版本控制。
通过生命周期管理,您可以为存储对象设置保留和跟踪规则。要了解存储提供商的生命周期管理,请参阅以下外部主题:
- Managing your storage lifecycle for Amazon S3 (https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
- Object Lifecycle Management for Google Cloud Storage (https://cloud.google.com/storage/docs/lifecycle)
- Lifecycle management policies in Azure (https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview)
为支持对象恢复,您还可以启用外部云存储的版本控制。
- To enable versioning for Amazon S3, see Enabling versioning on buckets (https://docs.aws.amazon.com/AmazonS3/latest/userguide/manage-versioning-examples.html).
- To enable versioning for Google Cloud Storage, see Use Object Versioning (https://cloud.google.com/storage/docs/using-object-versioning).
- To enable versioning for Azure, see Enable blob versioning (https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-enable?tabs=portal#enable-blob-versioning).
加密表文件
Snowflake 可以读取您使用常见的服务器端加密 (SSE) 方案加密的存储中的 Iceberg 表文件。您应该使用云服务提供商管理加密密钥,如果使用客户管理的密钥,则应授予 Snowflake 主体访问密钥的权限。
对于 Amazon S3,Snowflake 支持以下 SSE 选项:
| SSE option | Configuration |
|---|---|
| SSE with Amazon S3 managed keys (SSE-S3) | Specify ENCRYPTION = ( TYPE = 'AWS_SSE_S3' ) in the CREATE EXTERNAL VOLUME command. |
| SSE with AWS KMS keys (SSE-KMS) | Specify You must also grant privileges required for SSE-KMS encryption. For instructions, see Step 3 in Configure an external volume for Amazon S3. |
对于 Google Cloud Storage,Snowflake 支持以下 SSE 选项:
| SSE option | Configuration |
|---|---|
| SSE using keys stored in Google Cloud KMS | Specify You must also Grant the GCS service account permissions on the Google Cloud Key Management Service keys. |
Configure replication for Snowflake-managed Iceberg tables¶
You can replicate Snowflake-managed Iceberg tables by using a failover or replication group. Snowflake replicates a Snowflake-managed Iceberg table when you add the following objects to a failover or replication group:
- The parent database for the table
- The external volume that the table uses
For more information, see Configure replication for Snowflake-managed Apache Iceberg™ tables.