Best practices for Snowpipe Streaming with classic architecture¶
成本优化
作为最佳实践,对于每秒写入较多数据的 Snowpipe Streaming 客户端,我们建议减少此类客户端的 API 调用。使用 Java 或 Scala 应用程序汇总来自多个来源(例如 IoT 设备或传感器)的数据,然后使用 Snowflake Ingest SDK 调用 API,从而以更高的流速率加载数据。API 可以高效地汇总账户中多个目标表的数据。
一个 Snowpipe Streaming 客户端可以打开多个通道来发送数据,但客户端成本仅按每个活动客户端收取。通道数量不影响客户端成本。因此,我们建议每个客户端使用多个通道来优化性能和成本。
由于预占文件迁移操作的缘故,当您使用相同的表进行批处理和流式数据引入时,您还可以降低 Snowpipe Streaming 计算成本。
Snowpipe Streaming handles and bills all file-migration compute costs for tables with Automatic Clustering enabled, where Snowpipe Streaming is inserting data. This process optimizes and migrates data within the same transaction, incorporating costs previously associated with Automatic Clustering.
性能建议
为了在高吞吐量部署中实现最佳性能,我们建议采取以下措施:
-
If you are loading multiple rows, using
insertRowsis more efficient and cost effective than callinginsertRowmultiple times because less time is spent on locks.- Keep the size of each row batch passed to
insertRowsbelow 16 MB compressed. - 行批次的最佳大小介于 10-16 MB 之间。
- Keep the size of each row batch passed to
-
Pass values for the TIME, DATE, and all TIMESTAMP columns as one of the supported types from the
java.timepackage. -
When you create a channel using
OpenChannelRequest.builder, set theOnErrorOptiontoOnErrorOption.CONTINUE, and manually check the return value frominsertRowsfor potential ingestion errors. This approach currently leads to a better performance than relying on exceptions thrown whenOnErrorOption.ABORTis used. -
将默认日志级别设置为 DEBUG 时,请确保以下日志记录器继续在 INFO 级别记录日志:它们的 DEBUG 输出非常冗长,这可能导致性能显著下降。
net.snowflake.ingest.internal.apache.parquetorg.apache.parquet
-
Channels should be long lived when a client is actively inserting data and should be reused because offset token information is retained. Don’t close channels after inserting data because data inside the channels is automatically flushed based on the time configured in
MAX_CLIENT_LAG.
延迟建议
当您使用 Snowpipe Streaming 时,延迟是指写入通道的数据在 Snowflake 中可供查询的速度。Snowpipe Streaming 每隔一秒自动刷新通道内的数据,这意味着您无需明确关闭通道即可刷新数据。
Configuring latency with MAX_CLIENT_LAG
With Snowflake Ingest SDK versions 2.0.4 and later, you can fine-tune data flush latency by using the MAX_CLIENT_LAG option:
- 标准 Snowflake 表(非 Iceberg):默认 MAX_CLIENT_LAG 为 1 秒。您可以替换此设置,将所需的刷新延迟设置为 1 秒到最多 10 分钟。
- Snowflake-managed Iceberg Tables: Supported by Snowflake Ingest SDK versions 3.0.0 and later, the default
MAX_CLIENT_LAGis 30 seconds. This default helps ensure that optimized Parquet files are created, which is beneficial for query performance. While you can set a lower value, it’s generally not recommended unless you have exceptionally high throughput.
Latency recommendations for optimal performance
Setting MAX_CLIENT_LAG effectively can significantly impact query performance and the internal migration process (where Snowflake compacts small partitions).
对于低吞吐量场景,您可能每秒只发送少量数据(例如 1 行或 1 KB),频繁刷新可能会导致大量小分区。这会增加查询编译时间,因为 Snowflake 必须解析许多小分区,特别是在迁移过程可以压缩查询之前运行查询。
Therefore, you should set MAX_CLIENT_LAG as high as your target latency requirements allow. Buffering inserted rows for a longer duration allows Snowpipe Streaming to create better-sized partitions, which improves query performance and reduces migration overhead.
For example, if you have a task that runs every minute to merge or transform your streamed data, an optimal MAX_CLIENT_LAG might be between 50 and 55 seconds. This ensures data is flushed in larger chunks just before your downstream process runs.
Kafka connector for Snowpipe Streaming It’s important to note that the Kafka connector for Snowpipe Streaming has its own internal buffer. Whenthe Kafka buffer flush time is reached, data is then sent to Snowflake with the standard one-second latency through Snowpipe Streaming. For more information, see buffer.flush.time setting
关于确切传送一次数据的最佳实践
实现确切传送一次数据可能并不容易,在自定义代码中遵守以下原则至关重要:
为了确保适当地从异常、故障或崩溃中恢复,您必须始终重新打开通道,并使用最新提交的偏移令牌重新启动引入。
应用程序可能会维护自己的偏移,但使用 Snowflake 提供的、最新提交的偏移令牌作为事实来源,并相应地重置自己的偏移,这一点至关重要。
唯一应将自己的偏移视为事实来源的情况是,Snowflake 的偏移令牌设置为或重置为 NULL。NULL 偏移令牌通常意味着以下情况之一:
- 这是一个新通道,因此尚未设置偏移令牌。
- 目标表已丢弃并重新创建,因此该通道视为新通道。
- 30 日内未通过通道执行任何引入活动,因此通道被自动清理,偏移令牌信息丢失。
如有必要,可根据最新提交的偏移令牌定期清除已提交的源数据,并推进您自己的偏移。
If the table schema is modified when Snowpipe Streaming channels are active, the channel must be reopened. The Snowflake Kafka connector handles this scenario automatically, but if you use Snowflake Ingest SDK directly, you must reopen the channel yourself.
For more information about how the Kafka connector with Snowpipe Streaming achieves exactly-once delivery, see Exactly-once semantics.