Contribution Explorer(Snowflake Cortex ML 函数)

Contribution Explorer 旨在简化和改进围绕观察到的指标变化的根本原因分析过程。随着时间的推移,Contribution Explorer 会获取某个指标的值,并分析与该指标相关的数据变化,以确定每次变化是由哪些数据段引起的。

备注

Contribution Explorer 是 Snowflake 的智能、完全托管的 AI 和 ML 服务,是 Snowflake Cortex 的一部分。此功能是 Snowflake Cortex ML 函数套件的一部分。

例如,如果您正在跟踪销售,Contribution Explorer 可以帮助您确定位置、销售人员、客户、行业垂直以及导致收入不足的其他因素。然后,您可以立即采取有针对性的纠正措施。

利用 Contribution Explorer 进行分析的理想候选数据集具有以下特征:

  • 一个或多个严格非负指标。从一行到下一行的指标变化可能是负值,但量度本身绝不能是负值。

  • 一个或多个时间戳。

  • 可用于分割数据的列或维度。这些数据通常是分类数据(地点、细分市场等),但也可能是连续数据(即定量数据,例如温度或出勤率)。

若要在查询和管道中直接使用 Contribution Explorer,请调用 TOP_INSIGHTS (SNOWFLAKE.ML) 表函数。此函数会在数据集中找到最重要的维度,从这些维度构建片段,然后检测这些片段中哪些对指标有影响。

小技巧

要使用 Contribution Explore 分析仪表板查询,您可能需要添加一些维度列,这些列通常不会包含在跨所有维度汇总指标的查询中。

TOP_INSIGHTS 非常适合从具有大量维度的数据集中提取根本原因。此外,它还支持连续维度,并能在结果中指出具有负面条件的维度(例如,“区域不是北美”)。

TOP_INSIGHTS 的结果是导致数据变化的维度。

重要

法律声明。 此 Snowflake Cortex ML 函数由机器学习技术提供支持。机器学习技术和提供的结果可能不准确、不合适或较为片面。基于机器学习输出(包括内置在自动管道中的输出)的决策应有人工监督和审查流程,以确保模型生成的内容准确无误。Snowflake Cortex ML 函数查询将被视为任何其他 SQL 查询,并可能被视为 元数据

元数据。 当您使用 Snowflake Cortex ML 函数时,除了 元数据字段 中提及的内容外,Snowflake 还会记录 ML 函数返回的通用错误消息。这些错误日志可帮助我们排查出现的问题,并改进这些函数以更好地满足您的需求。

有关更多信息,请参阅 Snowflake AI 信任与安全 FAQ

更新搜索路径

如果您在搜索路径中添加了 SNOWFLAKE.ML,查询时就可以只写入 TOP_INSIGHTS 而无需写入 SNOWFLAKE.ML.TOP_INSIGHTS。以下语句显示了两者的区别:

SELECT SNOWFLAKE.ML.TOP_INSIGHTS(...);
SELECT TOP_INSIGHTS(...);
Copy

要将 SNOWFLAKE.ML 添加到搜索路径中,请执行以下命令:

ALTER SESSION SET SEARCH_PATH='$current, $public, SNOWFLAKE.ML';
Copy

搜索路径更改在 Snowflake 会话期间仍然有效。要持续更改搜索路径,请在账户级进行设置:

USE ROLE ACCOUNTADMIN;
ALTER ACCOUNT SET SEARCH_PATH='$current, $public, SNOWFLAKE.ML';
Copy

示例

此示例查询使用表 input_table。控制组用于训练模型,而测试组是您想要了解的数据。

  1. 创建表:

    CREATE OR REPLACE TABLE input_table(
      ds DATE, metric NUMBER, dim_country VARCHAR, dim_vertical VARCHAR);
    
    Copy
  2. 将控制组和测试组的记录插入到表中:

    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'usa' AS dim_country,
        'tech' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'usa' AS dim_country,
        'auto' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, seq4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'usa' AS dim_country,
        'fashion' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'usa' AS dim_country,
        'finance' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'canada' AS dim_country,
        'fashion' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'canada' AS dim_country,
        'finance' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'canada' AS dim_country,
        'tech' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'canada' AS dim_country,
        'auto' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'france' AS dim_country,
        'fashion' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'france' AS dim_country,
        'finance' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'france' AS dim_country,
        'tech' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'france' AS dim_country,
        'auto' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    -- Data for the test group
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 8, 1)) AS ds,
        UNIFORM(300, 320, RANDOM()) AS metric,
        'usa' AS dim_country,
        'auto' AS dim_vertica
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 8, 1))  AS ds,
        UNIFORM(400, 420, RANDOM()) AS metric,
        'usa' AS dim_country,
        'finance' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    Copy

示例查询

通过上面定义的表 input_table,下面的示例查询会生成精辟见解。

WITH input AS (
  SELECT
    {
      'country': input_table.dim_country,
      'vertical': input_table.dim_vertical
    }
    AS categorical_dimensions,
    {
         'length_of_vertical': length(input_table.dim_country)
    }
    AS continuous_dimensions,
    input_table.metric,
    IFF(ds BETWEEN '2020-08-01' AND '2020-08-20', TRUE, FALSE) AS label
  FROM input_table
  WHERE
    (ds BETWEEN '2020-05-01' AND '2020-05-20') OR
    (ds BETWEEN '2020-08-01' AND '2020-08-20')
)
SELECT res.* from input, TABLE(
  SNOWFLAKE.ML.TOP_INSIGHTS(
    input.categorical_dimensions,
    input.continuous_dimensions,
    CAST(input.metric AS FLOAT),
    input.label
  )
  OVER (PARTITION BY 0)
) res ORDER BY res.surprise DESC;
Copy

备注

有关函数 TOP_INSIGHTS 的实参信息,请参阅 TOP_INSIGHTS (SNOWFLAKE.ML)

示例查询的详细信息:

  • 输入表中的列 dim_country 和列 dim_vertical 成为维度 'country' 和维度 'vertical'

  • 连续维度由输入表中 vertical 列值的长度驱动。此列成为维度 length_of_vertical

  • 指标值位于列 'metric' 中。

  • 标签(用于确定给定行是测试数据还是控制数据)是一个布尔表达式,只有日期在 2020 年 8 月 1 日至 2020 年 8 月 20 日之间才是 TRUE。

  • WHERE 子句将考虑的行限制为控制范围和测试范围。

  • OVER(PARTITION BY 0)子句确保一起处理表格输入。

  • 查询结果被命名为 res

  • 结果按其意外情况进行排名,该意外表示根据父区段中的指标变化,区段中的测试指标超过其预期值的量。

查询结果:

+--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------+
| CONTRIBUTOR                    | METRIC_CONTROL | METRIC_TEST |         SURPRISE | RELATIVE_CHANGE | GROWTH_RATE  | EXPECTED_METRIC_TEST | OVERALL_METRIC_CONTROL | OVERALL_METRIC_TEST | OVERALL_GROWTH_RATE  | NEW_IN_TEST | MISSING_IN_TEST |
|--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------|
| [                              |            105 |        8327 |   7022.967741935 |     6.385578231 | 79.304761905 |       1304.032258065 |                   1271 |               15785 |         79.304761905 | False       | False           |
|   "country = usa",             |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
|   "vertical = finance"         |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
| ]                              |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
| [                              |            105 |        8327 |   7022.967741935 |     6.385578231 | 79.304761905 |       1304.032258065 |                   1271 |               15785 |         79.304761905 | False       | False           |
|   "not country = canada",      |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
|   "length_of_vertical <= 4.5", |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
|   "vertical = finance"         |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
| ]                              |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |

... (additional rows of output) ...

+--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------+

成本注意事项

使用 TOP_INSIGHTS 函数会产生计算成本。运行时随维度数量和基数的改变而改变。(如果维度的基数大于 5,则基数会自动减少。)

有关 Snowflake 计算成本的一般信息,请参阅 了解计算成本

语言: 中文