Contribution Explorer(Snowflake Cortex ML 函数)¶
Contribution Explorer 旨在简化和改进围绕观察到的指标变化的根本原因分析过程。随着时间的推移,Contribution Explorer 会获取某个指标的值,并分析与该指标相关的数据变化,以确定每次变化是由哪些数据段引起的。
备注
Contribution Explorer 是 Snowflake 的智能、完全托管的 AI 和 ML 服务,是 Snowflake Cortex 的一部分。此功能是 Snowflake Cortex ML 函数套件的一部分。
例如,如果您正在跟踪销售,Contribution Explorer 可以帮助您确定位置、销售人员、客户、行业垂直以及导致收入不足的其他因素。然后,您可以立即采取有针对性的纠正措施。
利用 Contribution Explorer 进行分析的理想候选数据集具有以下特征:
一个或多个严格非负指标。从一行到下一行的指标变化可能是负值,但量度本身绝不能是负值。
一个或多个时间戳。
可用于分割数据的列或维度。这些数据通常是分类数据(地点、细分市场等),但也可能是连续数据(即定量数据,例如温度或出勤率)。
若要在查询和管道中直接使用 Contribution Explorer,请调用 TOP_INSIGHTS (SNOWFLAKE.ML) 表函数。此函数会在数据集中找到最重要的维度,从这些维度构建片段,然后检测这些片段中哪些对指标有影响。
小技巧
要使用 Contribution Explore 分析仪表板查询,您可能需要添加一些维度列,这些列通常不会包含在跨所有维度汇总指标的查询中。
TOP_INSIGHTS 非常适合从具有大量维度的数据集中提取根本原因。此外,它还支持连续维度,并能在结果中指出具有负面条件的维度(例如,“区域不是北美”)。
TOP_INSIGHTS 的结果是导致数据变化的维度。
重要
法律声明。 此 Snowflake Cortex ML 函数由机器学习技术提供支持。机器学习技术和提供的结果可能不准确、不合适或较为片面。基于机器学习输出(包括内置在自动管道中的输出)的决策应有人工监督和审查流程,以确保模型生成的内容准确无误。Snowflake Cortex ML 函数查询将被视为任何其他 SQL 查询,并可能被视为 元数据。
元数据。 当您使用 Snowflake Cortex ML 函数时,除了 元数据字段 中提及的内容外,Snowflake 还会记录 ML 函数返回的通用错误消息。这些错误日志可帮助我们排查出现的问题,并改进这些函数以更好地满足您的需求。
有关更多信息,请参阅 Snowflake AI 信任与安全 FAQ。
更新搜索路径¶
如果您在搜索路径中添加了 SNOWFLAKE.ML,查询时就可以只写入 TOP_INSIGHTS 而无需写入 SNOWFLAKE.ML.TOP_INSIGHTS。以下语句显示了两者的区别:
SELECT SNOWFLAKE.ML.TOP_INSIGHTS(...);
SELECT TOP_INSIGHTS(...);
要将 SNOWFLAKE.ML 添加到搜索路径中,请执行以下命令:
ALTER SESSION SET SEARCH_PATH='$current, $public, SNOWFLAKE.ML';
搜索路径更改在 Snowflake 会话期间仍然有效。要持续更改搜索路径,请在账户级进行设置:
USE ROLE ACCOUNTADMIN;
ALTER ACCOUNT SET SEARCH_PATH='$current, $public, SNOWFLAKE.ML';
示例¶
此示例查询使用表 input_table
。控制组用于训练模型,而测试组是您想要了解的数据。
创建表:
CREATE OR REPLACE TABLE input_table( ds DATE, metric NUMBER, dim_country VARCHAR, dim_vertical VARCHAR);
将控制组和测试组的记录插入到表中:
INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'usa' AS dim_country, 'tech' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'usa' AS dim_country, 'auto' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, seq4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'usa' AS dim_country, 'fashion' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'usa' AS dim_country, 'finance' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'canada' AS dim_country, 'fashion' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'canada' AS dim_country, 'finance' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'canada' AS dim_country, 'tech' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'canada' AS dim_country, 'auto' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'france' AS dim_country, 'fashion' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'france' AS dim_country, 'finance' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'france' AS dim_country, 'tech' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'france' AS dim_country, 'auto' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); -- Data for the test group INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 8, 1)) AS ds, UNIFORM(300, 320, RANDOM()) AS metric, 'usa' AS dim_country, 'auto' AS dim_vertica FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 8, 1)) AS ds, UNIFORM(400, 420, RANDOM()) AS metric, 'usa' AS dim_country, 'finance' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365));
示例查询¶
通过上面定义的表 input_table
,下面的示例查询会生成精辟见解。
WITH input AS (
SELECT
{
'country': input_table.dim_country,
'vertical': input_table.dim_vertical
}
AS categorical_dimensions,
{
'length_of_vertical': length(input_table.dim_country)
}
AS continuous_dimensions,
input_table.metric,
IFF(ds BETWEEN '2020-08-01' AND '2020-08-20', TRUE, FALSE) AS label
FROM input_table
WHERE
(ds BETWEEN '2020-05-01' AND '2020-05-20') OR
(ds BETWEEN '2020-08-01' AND '2020-08-20')
)
SELECT res.* from input, TABLE(
SNOWFLAKE.ML.TOP_INSIGHTS(
input.categorical_dimensions,
input.continuous_dimensions,
CAST(input.metric AS FLOAT),
input.label
)
OVER (PARTITION BY 0)
) res ORDER BY res.surprise DESC;
备注
有关函数 TOP_INSIGHTS 的实参信息,请参阅 TOP_INSIGHTS (SNOWFLAKE.ML)。
示例查询的详细信息:
输入表中的列
dim_country
和列dim_vertical
成为维度'country'
和维度'vertical'
。连续维度由输入表中
vertical
列值的长度驱动。此列成为维度length_of_vertical
。指标值位于列
'metric'
中。标签(用于确定给定行是测试数据还是控制数据)是一个布尔表达式,只有日期在 2020 年 8 月 1 日至 2020 年 8 月 20 日之间才是 TRUE。
WHERE 子句将考虑的行限制为控制范围和测试范围。
OVER(PARTITION BY 0)子句确保一起处理表格输入。
查询结果被命名为
res
。结果按其意外情况进行排名,该意外表示根据父区段中的指标变化,区段中的测试指标超过其预期值的量。
查询结果:
+--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------+
| CONTRIBUTOR | METRIC_CONTROL | METRIC_TEST | SURPRISE | RELATIVE_CHANGE | GROWTH_RATE | EXPECTED_METRIC_TEST | OVERALL_METRIC_CONTROL | OVERALL_METRIC_TEST | OVERALL_GROWTH_RATE | NEW_IN_TEST | MISSING_IN_TEST |
|--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------|
| [ | 105 | 8327 | 7022.967741935 | 6.385578231 | 79.304761905 | 1304.032258065 | 1271 | 15785 | 79.304761905 | False | False |
| "country = usa", | | | | | | | | | | | |
| "vertical = finance" | | | | | | | | | | | |
| ] | | | | | | | | | | | |
| [ | 105 | 8327 | 7022.967741935 | 6.385578231 | 79.304761905 | 1304.032258065 | 1271 | 15785 | 79.304761905 | False | False |
| "not country = canada", | | | | | | | | | | | |
| "length_of_vertical <= 4.5", | | | | | | | | | | | |
| "vertical = finance" | | | | | | | | | | | |
| ] | | | | | | | | | | | |
... (additional rows of output) ...
+--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------+
成本注意事项¶
使用 TOP_INSIGHTS 函数会产生计算成本。运行时随维度数量和基数的改变而改变。(如果维度的基数大于 5,则基数会自动减少。)
有关 Snowflake 计算成本的一般信息,请参阅 了解计算成本。