将 Snowflake 暂存区卷与服务搭配使用¶
Snowflake supports various storage volume types for your application containers, including internal stage, local storage, memory storage, and block storage volumes. This section explains how to configure volumes and volume mounts for internal stages. An internal stage volume is a volume configured to use a Snowflake stage as persistent storage.
With stage volumes your service can access an internal stage's objects as if they are files on your file system, simplifying your service code compared to using a Snowflake driver and GET and PUT SQL commands to access these objects. Stage volumes can also perform better for scenarios with streaming reads or writes of large data files.
If your file system operations can easily be translated to streaming GET and PUT operations, then Stage volumes will work well for your scenario. If your application needs to rename or move files, modify existing files, or perform file system based locking, then stage volume is not a good fit for your workload.
备注
There are currently two implementations of stage volumes; a generally available version and a deprecated version. Snowflake recommends that you use the generally available version for new services and that you migrate your existing applications from the deprecated version.
The stage volume implementation streams file contents directly to and from cloud storage, ensuring that you always get the latest contents. Consider the following points when you use a stage volume:
A stage volume is optimized for large, sequential reads and writes, providing strong performance for these access patterns. For best results, read and write data in large, contiguous chunks.
Reads always return the latest data, which lets data sharing occur between services.
Random writes or file appends aren't supported. Attempting these operations results in an error. Snowflake recommends that you use block storage volumes for these workloads.
Configure a Snowflake stage as a storage volume in a service specification¶
To create a service where service containers use a stage volume, you perform two steps to specify the required settings in the service specification:
Define a stage volume that identifies the Snowflake stage to use as storage volume.
Specify where to mount the stage volume in your application container.
Step 1: Define a stage volume¶
To define a stage volume, add the spec.volumes field in the service specification as shown in the following example:
The following list defines the fields from the example:
name: Provides the name of the volume.source: Identifies the type of the volume (stage).stageConfig.name: Identifies the Snowflake internal stage or folder on a stage to mount; for example@my_stage,@my_stage/folder, or@my_db.my_schema.my_stage/folder/nestedfolder. Double quotes must surround this value.
You can include the following optional fields in stageConfig:
stageConfig.resourcesfield: The Snowflake component that provides the mounted stage volume requires CPU and memory resources. Use this field to specify these CPU and memory requirements, similar to the resource specifications for your application containers. For more information, see containers.resources 字段 fields. If this field isn't specified, the following default resource settings apply:resources.requests.cpu: 0resources.requests.memory: 0.5Giresources.limits.cpu: 0.5resources.limits.memory: 1Gi
For most applications with typical data traffic to stage volumes, you don't need to set this field, because the default resource settings should be sufficient. However, if your application performs a high volume of reads and writes, the default settings might lead to performance constraints or read/write failures. For more information, see Common guidelines for both implementations of stage volumes.
To avoid these problems, check the CPU and memory metrics for the container (
stage-mount-v2-sidecar-<stage-volume-name>). Snowflake adds this container to your service that provides the implementation of the stage volume you configured. This lets you to see exactly what resources your stage volume is using and determine if it is constrained by CPU or memory. Use these metrics to update the CPU and memory limits as needed.stageConfig.metadataCachefield: If your application frequently retrieves file metadata or lists files on a Snowflake stage in a loop, and you don't expect the data to change often, you can enable metadata caching for the Snowflake stage storage volume to significantly improve performance. The cache stores this metadata for a specified time period, after which it is cleared. If the application then tries to access the metadata, Snowflake refreshes the cache. Use the hours, minutes, and seconds units to specify themetadataCache. For example90s,5m,1h,1h30m,1h30m45s. If not specified, there is no caching.备注
Configure metadata caching only when the data in your Snowflake stage doesn't change for service lifetime; for example, a service that has read-only workloads that need to work on a static set of data in the stage. Don't configure metadata caching for workloads where data in your Snowflake stage is updated while the service is running.
The following spec.volumes field defines a Snowflake stage volume. The field includes the optional stageConfig fields:
Step 2: Specify where to mount the stage volume in the container¶
After you define a Snowflake stage storage volume by adding the spec.volumes field, use the spec.containers.volumeMounts field to describe where to mount the stage volume in your application containers, as shown in the following example:
The information you provide in this field is consistent across all supported storage volume types and applies to both implementations of stage volumes.
示例¶
Create a service with a stage
mydb.myschema.ai_models_stagemounted at/path/to/stagein the main container.Create a service with a stage subpath
mydb.myschema.ai_models_stage/subpathmounted at/path/to/stagein the main container.
访问控制要求¶
服务的所有者角色是用于创建服务的角色。这也是服务在与 Snowflake 交互时使用的角色。所有者角色确定授予应用程序容器访问已挂载暂存区的权限。所有者角色必须对暂存区具有 READ 权限。
如果所有者角色在某个暂存区没有 WRITE 权限,则该暂存区的挂载为只读。也就是说,容器只能从暂存区读取文件。所有者角色需要暂存区挂载的暂存区 WRITE 权限才能支持读取和写入。
About the deprecated implementation¶
The deprecated stage-volume implementation uses a shared cache for reads and writes. Although this works well for some scenarios, you can't control whether data is read from the cache or directly from the stage, which might not be suitable for all applications. Additionally, when you use the cache for reads and writes, this can introduce performance overhead.
Migrating code from the deprecated implementation¶
The newer implementation replaces the deprecated implementation, with the following behavioral changes:
The newer stage-volume implementation streams file contents directly to and from cloud storage, ensuring that you always get the latest contents. This provides predictable behavior and significantly faster throughput. The deprecated stage-volume implementation caches chunks of file data, making it difficult to know if you are reading the latest data.
Random read performance might be lower with the new implementation because of the removal of caching. However, without a local disk cache, consistency across volumes is improved. File changes are written directly to the backing stage when the file is closed, with no local disk buffering.
Reads always return the latest data, making this configuration better for sharing data between services.
Random writes or file appends aren't supported. Attempting these operations results in an error. Snowflake recommends that you use block storage volumes for these workloads.
Specify a Snowflake stage volume in a service specification (deprecated)¶
To create a service where service containers use Snowflake stage volume, specify the required settings in the service specification as shown in the following steps:
To specify the stage volume, use the
spec.volumesfield as shown in the following example:以下字段为必填字段:
name: The name of the volume.source: The Snowflake internal stage or folder on the stage to mount; for example@my_stage,@my_stage/folder. Quotes must surround this value.
To describe where to mount the stage volume in your application containers, use the
spec.containers.volumeMountsfield, as shown in the following example:The information you provide in this field is consistent across all supported storage volume types and applies to both implementations of stage volumes.
Example (deprecated)¶
In the example service specification, the app container mounts an internal stage @model_stage by using the deprecated stage volume implementations:
The volumeMounts field specifies where inside the container to mount the stage volume. This specification remains the same for both the stage volume implementations.
Guidelines when using stage volumes¶
This section provides you with guidelines to follow when you implement application code in which a container uses a Snowflake stage as storage volume.
Common guidelines for both implementations of stage volumes¶
暂存区挂载针对顺序读写进行了优化。
暂存区挂载 I/O 操作的延迟可能高于容器文件系统和块存储卷上的 I/O 操作。您应始终检查 I/O 操作的状态码,以确保它们成功。
暂存区挂载异步上传文件更新。只有在成功关闭或刷新文件描述符后,才能保证在暂存区挂载中对文件所做的变更保留到暂存区中。暂存区挂载中的文件变更可能会有一些延迟才能对其他容器和 Snowflake 可见。
已挂载暂存区中的每个目录应包含少于 100,000 个文件。预计
readdir延迟会随着目录中文件数量的增加而增加。
Guidelines when using the deprecated version of the stage volume implementation¶
避免在暂存区挂载中同时写入多个文件。
暂存区挂载不是网络文件系统。不要使用暂存区挂载进行多客户协调。
不要同时打开同一个文件的多个句柄。使用打开的文件句柄进行读取或写入操作,但不能将两者混用。要在写入文件后读取文件,请关闭该文件,然后在读取之前重新打开该文件。
Guidelines when using the generally available stage volume implementation¶
不支持从多个暂存区挂载(安装在不同容器上的同一个暂存区卷)对同一文件进行并发写入。
The absence of a local disk cache improves consistency across mounts. File changes are flushed directly to the backing stage upon closing the file, with no local disk buffering. Reads always return the latest data, making the new stage mount better for sharing data between services.
Read and write data in large, contiguous chunks for optimal performance. The performance penalty for small reads and writes when compared to the generally available stage volume implementation, can mitigate the performance gains from the new implementation.
Limitations when using stage volumes¶
以下是一般限制。如果您在这些限制方面遇到任何问题,请联系您的账户代表。
Common limitations for both implementations of stage volumes¶
您只能在暂存区中挂载暂存区或子目录。例如,
@my_stage、@my_stage/folder。您不能在一个暂存区挂载单个文件,例如@my_stage/folder/file。不支持外部暂存区。仅支持 Snowflake 内部暂存区。
暂存区挂载不是完全 POSIX 兼容的文件系统。例如:
文件和目录重命名不是原子的。
不支持硬链接。
监控文件系统变更的 Linux 内核子系统索引节点通知 (inotify) 不适用于暂存区挂载。
Limitations when using the deprecated version of the stage volume implementation¶
每项服务最多允许 5 个暂存区卷(请参阅 spec.volumes)。
每个计算池节点最多支持 8 个暂存区卷。Snowflake 管理每个节点的暂存区挂载限制,与管理内存、CPU 和 GPU 的方式类似。当现有节点无法支持所请求的暂存区挂载时,启动新的服务实例可能会导致 Snowflake 启动新节点。
The stage volume capabilities vary depending on the cloud platform for your Snowflake account:
AWS 上的账户同时支持 SNOWFLAKE_FULL 和 SNOWFLAKE_SSE 暂存区加密(请参阅 内部暂存区参数)。
Azure 上的账户目前支持 SNOWFLAKE_SSE 加密暂存区。在执行 CREATE STAGE 时,请使用 ENCRYPTION 参数指定加密类型:
CREATE STAGE my_stage ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');Accounts on Google Cloud aren't supported.
不支持从多个暂存区挂载(安装在不同容器上的同一个暂存区卷)对同一文件进行并发写入。
Limitations when using the generally available version of the stage volume implementation¶
不支持硬链接。
Each stage that is mounted requires 512 MB memory per stage. This means that there is a limitation on the number of stage volumes that can be used based on instance size. Mounting the volume on multiple containers doesn't increase memory consumption.
A maximum of 20 stage volumes are allowed per service. For more information, see spec.volumes.