优化存储以提高性能

本主题讨论可以提高查询性能的存储优化,例如将类似数据存储在一起、创建优化的数据结构以及定义专门的数据集。Snowflake 提供三种此类存储策略:自动聚类、搜索优化和物化视图。

通常,这些存储策略并不能实质性地提高已经可在一秒钟或更短的时间内执行的查询的性能。

The strategies discussed in this topic are just one way to boost the performance of queries. For strategies related to the computing resources used to execute a query, refer to Optimizing warehouses for performance.

存储策略简介

自动聚类

Snowflake stores a table’s data in micro-partitions. Among these micro-partitions, Snowflake organizes (i.e. clusters) data based on dimensions of the data. If a query filters, joins, or aggregates along those dimensions, fewer micro-partitions must be scanned to return results, which speeds up the query considerably.

You can set a cluster key to change the default organization of the micro-partitions so data is clustered around specific dimensions (i.e. columns). Choosing a cluster key improves the performance of queries that filter, join, or aggregate by the columns defined in the cluster key.

Snowflake enables Automatic Clustering to maintain the clustering of the table as soon as you define a cluster key. Once enabled, Automatic Clustering updates micro-partitions as new data is added to the table. Learn More

搜索优化服务

搜索优化服务提高了点查找查询(即“大海捞针搜索”)的性能,这些查询使用高选择性筛选器减少从表中返回的行。当非常需要低延迟的点查找查询(例如调查日志搜索、威胁或异常检测以及带有选择性筛选器的关键仪表板)时,搜索优化服务是理想之选。

搜索优化服务通过构建针对特定搜索类型进行优化的持久数据结构来减少点查找查询的延迟。

You can enable the Search Optimization Service for an entire table or for specific columns. As long as they are selective enough, equality searches, substring searches, and geo searches against those columns can be sped up significantly.

The Search Optimization Service supports both structured and semi-structured data (see supported data types).

The Search Optimization Service requires Snowflake Enterprise Edition or higher. Learn More

物化视图

A materialized view is a pre-computed data set derived from a SELECT statement that is stored for later use. Because the data is pre-computed, querying a materialized view is faster than executing a query against the base table on which the view is defined. For example, if you specify SELECT SUM(column1) when creating the materialized view, then a query that returns SUM(column1) from the view executes faster because column1 has already been aggregated.

物化视图旨在提高以下工作负载的查询性能:这些工作负载由常见的重复查询模式组成,这些模式相对于基表返回少量行和/或列。

物化视图不能基于多个表。

Materialized views require Snowflake Enterprise Edition or higher. Learn More

选择优化策略

不同类型的查询可受益于不同的存储策略。您可以通过以下部分了解哪种策略最适合某个工作负载。

自动聚类是最广泛的选项,它可以使访问表中相同列的一系列查询受益。管理员通常根据频率和延迟要求选择最重要的查询,然后选择可最大限度地提高这些查询性能的聚类键。当许多查询过滤、联接或聚合相同的几列时,自动聚类才有意义。

搜索优化服务和物化视图的范围较窄。当特定查询访问表中明确定义的数据子集时,管理员可以使用查询的特征来决定使用搜索优化服务还是物化视图可以提高性能。例如,管理员可以识别重要的点查找查询并为表或列实施搜索优化服务。同样,管理员可以通过创建物化视图来优化特定查询模式。

You can implement more than one of these strategies for a table, and an individual query with multiple filters could potentially benefit from both Automatic Clustering and the Search Optimization Service. However, enabling the Search Optimization Service or creating a materialized view on a clustered table can be more expensive. To learn why this increases compute costs, refer to Ongoing Costs (in this topic).

如果有多种策略有可能提高特定查询的性能,则可能需要从自动聚类或搜索优化服务入手,因为其他具有类似访问模式的查询也可以得到改进。

区分注意事项

以下内容并不是对存储策略的详尽比较,而是说明了区分它们时要考虑的最重要事项。

Automatic Clustering:
  • Biggest performance boost comes from a WHERE clause that filters on a column of the cluster key, but it can also improve the performance of other clauses and functions that act upon that same column (e.g. joins and aggregations).
  • Ideal for range queries or queries with an inequality filter. Also improves an equality filter, but the Search Optimization Service is usually faster for point lookup queries.
  • 在 Snowflake 的 Standard Edition 中可用。
  • There can be only one cluster key. [1] If different queries against a table act upon different columns, consider using the Search Optimization Service or a materialized view instead.
Search Optimization Service:
  • Improves point lookup queries that return a small number of rows. If the query returns more than a few records, consider Automatic Clustering instead.
  • 包括对符合以下条件的点查找查询的支持:
    • 匹配使用 LIKE 和 RLIKE 等谓词的子字符串或正则表达式。
    • 在 VARIANT、 ARRAY、或 OBJECT 列中搜索特定字段。
    • 使用带有 GEOGRAPHY 值的地理空间函数。
Materialized view:
  • 改进了密集和频繁的计算,例如聚合和分析半结构化数据(而不仅仅是筛选)。
  • 通常侧重于特定查询/子查询计算。
  • Improves queries against external tables.

[1] If there is an important reason to define multiple cluster keys, you could create multiple materialized views, each with its own cluster key.

原型查询

以下示例旨在重点介绍哪种类型的查询通常在搭配使用特定存储策略时运行更快。

Prototypical Query for Clustering

Automatic Clustering provides a performance boost for range queries with large table scans. For example, the following query will execute faster if the shipdate column is the table’s cluster key because the WHERE clause scans a lot of data.

SELECT
  SUM(quantity) AS sum_qty,
  SUM(extendedprice) AS sum_base_price,
  AVG(quantity) AS avg_qty,
  AVG(extendedprice) AS avg_price,
  COUNT(*) AS count_order
FROM lineitem
WHERE shipdate >= DATEADD(day, -90, to_date('2023-01-01'));

For an additional example of a query that might run faster if the table was clustered, refer to Benefits of Defining Clustering Keys (for Very Large Tables).

Prototypical Query for Search Optimization

The Search Optimization Service can provide a performance boost for point lookup queries that scan a large table to return a small subset of records. For example, the following query will execute faster with the Search Optimization Service if the sender_ip column has a large number of distinct values.

SELECT error_message, receiver_ip
FROM logs
WHERE sender_ip IN ('198.2.2.1', '198.2.2.2');

要查看搭配使用搜索优化服务可能运行更快的其他查询,请参阅以下示例:

Prototypical Query for Materialized View

A materialized view can provide a performance boost for queries that access a small subset of data using expensive operations like aggregation. As an example, suppose that an administrator aggregated the totalprice column when creating a materialized view mv_view1. The following query against the materialized view will execute faster than it would against the base table.

SELECT
  orderdate,
  SUM(totalprice)
FROM mv_view1
GROUP BY 1;

For more use cases where materialized views can speed up queries, refer to Examples of Use Cases For Materialized Views.

实施和成本注意事项

本节讨论在使用存储策略提高查询性能时的成本注意事项,以及平衡成本和性能时的实施注意事项。

初始投资

Implementing a storage strategy can require a bigger time commitment and upfront financial investment than other types of performance optimizations (e.g. re-writing SQL statements or optimizing the warehouse running the query), but the performance improvements can be significant.

Snowflake uses serverless compute resources to implement each storage strategy, which consumes credits before you can test how well the optimization improves performance. In addition, it can take Snowflake a significant amount of time to fully implement Automatic Clustering and the Search Optimization Service (e.g. a week for a very large table).

搜索优化服务和物化视图还需要 Enterprise Edition 或更高版本,这会提高 credit 的价格。

持续成本

存储策略会产生计算和存储成本。

Compute Costs

当向表中添加新数据时,Snowflake 使用无服务器计算资源来持续进行存储优化。对表做出的更改越多,维护成本越高。如果表不断更新,维护存储优化的成本可能会高得令人望而却步。

当为基础表启用自动聚类时,维护物化视图或搜索优化服务的成本可能会很高。借助自动聚类,Snowflake 不断围绕聚类键的维度重新聚类其微分区。每次重新聚类基表时,Snowflake 都必须使用无服务器计算资源来更新物化视图和搜索优化服务使用的存储空间。因此,除了基表的 DML 命令成本外,基表的自动聚类活动还会触发物化视图和搜索优化服务的维护成本。

Storage Costs
Automatic Clustering

Unlike the Search Optimization Service and materialized views, Automatic Clustering reorganizes existing data rather than creating additional storage. However, reclustering can incur additional storage costs if it increases the size of Fail-safe storage. For details, refer to Credit and Storage Impact of Reclustering.

Search Optimization / Materialized Views

物化视图和搜索优化服务会产生额外存储成本,按标准费率计费。

估算成本

Automatic Clustering

You can run the SYSTEM$ESTIMATE_AUTOMATIC_CLUSTERING_COSTS function to help estimate the cost of enabling Automatic Clustering for a table and maintaining the table in a well-clustered state. This estimate is based on the change history of the table. Actual costs can vary significantly, especially if DML patterns change after enabling Automatic Clustering.

Search Optimization Service

You can run the SYSTEM$ESTIMATE_SEARCH_OPTIMIZATION_COSTS function to help estimate the cost of adding the Search Optimization Service to a column or entire table. The estimated costs are proportional to the number of columns that will be enabled and how much the table has recently changed.

实施策略

由于存储策略的计算成本和存储成本可能很高,因此在进行更广泛的实施之前,您可能需要从小规模开始,仔细跟踪初始和持续成本。例如,您可以只为一两个表选择聚类键并评估成本,然后再为其他表选择键。

在跟踪与存储策略相关的持续成本时,请记住,虚拟仓库仅在运行查询期间消耗 credit,因此查询运行速度越快,成本越低。Snowflake 建议在存储优化之前仔细报告运行查询的成本,并将其与存储优化后运行相同查询的成本进行比较,以便将其纳入成本评估。