分类(Snowflake ML 函数)

分类使用机器学习算法,运用在训练数据中检测到的模式将数据划分为不同类别。支持二元分类(两个类)和多类别分类(两个以上的类)。常见分类用例包括客户流失预测、信用卡欺诈检测和垃圾邮件检测。

Note

分类是 Snowflake 由机器学习提供支持的业务分析工具套件的一部分。

分类涉及到创建分类模型对象,传入对训练数据的引用。该模型会拟合到提供的训练数据中。然后,您可以使用生成的架构级分类模型对象对新数据点进行分类,并通过其评估 APIs 了解模型的准确性。

关于分类模型

The classification function is powered by a gradient boosting machine. For binary classification, the model is trained using an area-under-the-curve loss function. For multi-class classification, the model is trained using a logistic loss function.

适合用于分类的训练数据集包括代表各数据点标注类别的目标列,以及至少一个特征列。

分类模型支持特征和标签的数字、布尔和字符串数据类型。

  • 数字特征被视为连续特征。要将数值特征视为分类特征,请将其转换为字符串。
  • String features are treated as categorical. The classification function supports high-cardinality features (for example, job titles or fruits). It does not support full free text, like sentences or paragraphs.
  • 布尔特征被视为分类特征。
  • Timestamps must be TIMESTAMP_NTZ type. The model creates additional time-based features (epoch, day, week, month), which are used in training and classification. These features appear in SHOW_FEATURE_IMPORTANCE results as derived_features.
  • 标签(目标)列的基数必须大于 1 且小于数据集中的行数。

推理数据与训练数据必须具有相同的特征名称和类型。分类特征值在训练数据集中不存在并非错误。如果推理数据在训练数据集中不存在,相应列将被忽略。

可以评估分类模型,以确定预测质量。在评估过程中会对原始数据训练其他模型,但会保留某些数据点。然后,将使用保留的数据点进行推断,并将预测的类别与实际类别进行比较。

当前限制

  • Training and inference data must be numeric, TIMESTAMP_NTZ, Boolean, or string. Other types must be cast to one of these types.
  • 您不能选择或修改分类算法。
  • 无法手动指定或调整模型参数。
  • In tests, training on a Medium Snowpark-optimized warehouse has succeeded with up to 1,000 columns and 600 million rows. It is possible, but unlikely, to run out of memory below this limit.
  • 目标列包含的非重复类不得超过 255 个。
  • SNOWFLAKE.ML.CLASSIFICATION instances cannot be cloned. When you clone or replicate a database containing a classification model, the model is currently skipped.

准备分类

在使用分类之前,您必须:

You might also want to modify your search path to include the SNOWFLAKE.ML schema.

选择虚拟仓库

A Snowflake virtual warehouse provides the compute resources for training and using classification machine learning models. This section provides general guidance on selecting the best type and size of warehouse for classification, focusing on the training step, the most time-consuming and memory-intensive part of the process.

您应该根据训练数据的大小选择仓库类型。标准仓库的 Snowpark 内存限制较低,更适合行数或特征数量较少的原型设计作业。标准仓库的内存限制不会随仓库大小而增加。

随着行数和特征数量的增加,可以考虑使用 Snowpark 优化型仓库来确保作业可以成功运行。Snowpark 优化型仓库的内存限制不会提高到超过“中型”。

为了尽可能提高性能,使用没有其他并发工作负载的专用仓库训练模型。

为尽量降低成本,我们建议使用超小型标准仓库进行原型设计。对于较大的数据集和生产工作负载,请使用中型 Snowpark 优化型仓库。

授予权限以创建分类模型

训练分类模型会生成架构级对象。因此,用于创建模型的角色必须对将创建模型的架构具有 CREATE SNOWFLAKE.ML.CLASSIFICATION 权限,这样才能将模型存储在该架构中。此权限类似于其他架构权限,如 CREATE TABLE 或 CREATE VIEW。

Snowflake recommends that you create a role named analyst to be used by people who need to create classification models.

In the following example, the admin role is the owner of the schema admin_db.admin_schema. The analyst role needs to create models in this schema.

USE ROLE admin;
GRANT USAGE ON DATABASE admin_db TO ROLE analyst;
GRANT USAGE ON SCHEMA admin_schema TO ROLE analyst;
GRANT CREATE SNOWFLAKE.ML.CLASSIFICATION ON SCHEMA admin_db.admin_schema TO ROLE analyst;

To use this schema, a user assumes the role analyst:

USE ROLE analyst;
USE SCHEMA admin_db.admin_schema;

If the analyst role has CREATE SCHEMA privileges in database analyst_db, the role can create a new schema analyst_db.analyst_schema and create classification models in that schema:

USE ROLE analyst;
CREATE SCHEMA analyst_db.analyst_schema;
USE SCHEMA analyst_db.analyst_schema;

To revoke a role’s model creation privilege on the schema, use REVOKE <privileges> … FROM ROLE:

REVOKE CREATE SNOWFLAKE.ML.CLASSIFICATION ON SCHEMA admin_db.admin_schema FROM ROLE analyst;

训练、使用、查看、删除和更新模型

Note

SNOWFLAKE.ML.CLASSIFICATION runs using limited privileges, so by default, it does not have access to your data. You must therefore pass tables and views as references, which pass along the caller’s privileges. You can also provide a query reference instead of a reference to a table or a view.

See the CLASSIFICATION reference for information about training, inference, and evaluation APIs.

使用 CREATE SNOWFLAKE.ML.CLASSIFICATION 创建和训练模型。

CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION <model_name>(...);

要对数据集进行推断(预测),请使用模型的 PREDICT 方法。

SELECT <model_name>!PREDICT(...);

要评估模型,请调用所提供的评估方法。

CALL <model_name>!SHOW_EVALUATION_METRICS();
CALL <model_name>!SHOW_GLOBAL_EVALUATION_METRICS();
CALL <model_name>!SHOW_THRESHOLD_METRICS();
CALL <model_name>!SHOW_CONFUSION_MATRIX();

要显示模型的特征重要性排名,请调用其 SHOW_FEATURE_IMPORTANCE 方法。

CALL <model_name>!SHOW_FEATURE_IMPORTANCE();

要调查训练期间生成的日志,请使用 SHOW_TRAINING_LOGS 方法。如果没有可用训练日志,此调用将返回 NULL。

CALL <model_name>!SHOW_TRAINING_LOGS();

Tip

For examples of using these methods, see Examples.

要查看所有分类模型,请使用 SHOW 命令。

SHOW SNOWFLAKE.ML.CLASSIFICATION;

要删除分类模型,请使用 DROP 命令。

DROP SNOWFLAKE.ML.CLASSIFICATION <model_name>;

模型是不可变的,不能就地更新。要更新模型,请删除现有模型并训练新模型。针对此目的,CREATE 命令的 CREATE OR REPLACE 变体非常有用。

示例

为示例设置数据

The examples in this topic uses two tables. The first table, training_purchase_data, has two feature columns: a binary label column and a multi-class label column. The second table is called prediction_purchase_data and has two feature columns. Use the SQL below to create these tables.

CREATE OR REPLACE TABLE training_purchase_data AS (
    SELECT
        CAST(UNIFORM(0, 4, RANDOM()) AS VARCHAR) AS user_interest_score,
        UNIFORM(0, 3, RANDOM()) AS user_rating,
        FALSE AS label,
        'not_interested' AS class
    FROM TABLE(GENERATOR(rowCount => 100))
    UNION ALL
    SELECT
        CAST(UNIFORM(4, 7, RANDOM()) AS VARCHAR) AS user_interest_score,
        UNIFORM(3, 7, RANDOM()) AS user_rating,
        FALSE AS label,
        'add_to_wishlist' AS class
    FROM TABLE(GENERATOR(rowCount => 100))
    UNION ALL
    SELECT
        CAST(UNIFORM(7, 10, RANDOM()) AS VARCHAR) AS user_interest_score,
        UNIFORM(7, 10, RANDOM()) AS user_rating,
        TRUE AS label,
        'purchase' AS class
    FROM TABLE(GENERATOR(rowCount => 100))
);

CREATE OR REPLACE table prediction_purchase_data AS (
    SELECT
        CAST(UNIFORM(0, 4, RANDOM()) AS VARCHAR) AS user_interest_score,
        UNIFORM(0, 3, RANDOM()) AS user_rating
    FROM TABLE(GENERATOR(rowCount => 100))
    UNION ALL
    SELECT
        CAST(UNIFORM(4, 7, RANDOM()) AS VARCHAR) AS user_interest_score,
        UNIFORM(3, 7, RANDOM()) AS user_rating
    FROM TABLE(GENERATOR(rowCount => 100))
    UNION ALL
    SELECT
        CAST(UNIFORM(7, 10, RANDOM()) AS VARCHAR) AS user_interest_score,
        UNIFORM(7, 10, RANDOM()) AS user_rating
    FROM TABLE(GENERATOR(rowCount => 100))
);

训练和使用二元分类器

首先,创建一个包含用于训练的二元数据的视图。

CREATE OR REPLACE view binary_classification_view AS
    SELECT user_interest_score, user_rating, label
FROM training_purchase_data;
SELECT * FROM binary_classification_view ORDER BY RANDOM(42) LIMIT 5;

SELECT 语句通过以下形式返回结果。

+---------------------+-------------+-------+
| USER_INTEREST_SCORE | USER_RATING | LABEL |
|---------------------+-------------+-------|
| 5                   |           4 | False |
| 8                   |           8 | True  |
| 6                   |           5 | False |
| 7                   |           7 | True  |
| 7                   |           4 | False |
+---------------------+-------------+-------+

使用此视图创建和训练二元分类模型。

CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION model_binary(
    INPUT_DATA => SYSTEM$REFERENCE('view', 'binary_classification_view'),
    TARGET_COLNAME => 'label'
);

After you’ve created the model, use its PREDICT method to infer labels for the unlabeled purchase data. You can use wildcard expansion in an object literal to create key-value pairs of features for the INPUT_DATA argument.

SELECT model_binary!PREDICT(INPUT_DATA => {*})
    AS prediction FROM prediction_purchase_data;

该模型使用以下格式返回输出。预测对象包括每个类别的预测概率,以及基于最大预测概率的预测类别。预测的返回顺序与原始特征的提供顺序相同。

+-------------------------------------+
| PREDICTION                          |
%-------------------------------------%
| {                                   |
|   "class": "True",                  |
|   "logs": null,                     |
|   "probability": {                  |
|     "False": 1.828038600000000e-03, |
|     "True": 9.981719614000000e-01   |
|   }                                 |
| }                                   |
| {                                   |
|   "class": "False",                 |
|   "logs": null,                     |
|   "probability": {                  |
|     "False": 9.992944771000000e-01, |
|     "True": 7.055229000000000e-04   |
|   }                                 |
| }                                   |
| {                                   |
|   "class": "True",                  |
|   "logs": null,                     |
|   "probability": {                  |
|     "False": 3.429796010000000e-02, |
|     "True": 9.657020399000000e-01   |
|   }                                 |
| }                                   |
| {                                   |
|   "class": "False",                 |
|   "logs": null,                     |
|   "probability": {                  |
|     "False": 9.992687686000000e-01, |
|     "True": 7.312314000000000e-04   |
|   }                                 |
| }                                   |
| {                                   |
|   "class": "False",                 |
|   "logs": null,                     |
|   "probability": {                  |
|     "False": 9.992951615000000e-01, |
|     "True": 7.048385000000000e-04   |
|   }                                 |
| }                                   |
+-------------------------------------+

要将特征和预测压缩在一起,请使用类似下面的查询。

SELECT *, model_binary!PREDICT(INPUT_DATA => {*})
    AS predictions FROM prediction_purchase_data;
+---------------------+-------------+-------------------------------------+
| USER_INTEREST_SCORE | USER_RATING | PREDICTIONS                         |
|---------------------+-------------+-------------------------------------|
| 9                   |           8 | {                                   |
|                     |             |   "class": "True",                  |
|                     |             |   "logs": null,                     |
|                     |             |   "probability": {                  |
|                     |             |     "False": 1.828038600000000e-03, |
|                     |             |     "True": 9.981719614000000e-01   |
|                     |             |   }                                 |
|                     |             | }                                   |
| 3                   |           0 | {                                   |
|                     |             |   "class": "False",                 |
|                     |             |   "logs": null,                     |
|                     |             |   "probability": {                  |
|                     |             |     "False": 9.992944771000000e-01, |
|                     |             |     "True": 7.055229000000000e-04   |
|                     |             |   }                                 |
|                     |             | }                                   |
| 10                  |           7 | {                                   |
|                     |             |   "class": "True",                  |
|                     |             |   "logs": null,                     |
|                     |             |   "probability": {                  |
|                     |             |     "False": 3.429796010000000e-02, |
|                     |             |     "True": 9.657020399000000e-01   |
|                     |             |   }                                 |
|                     |             | }                                   |
| 6                   |           6 | {                                   |
|                     |             |   "class": "False",                 |
|                     |             |   "logs": null,                     |
|                     |             |   "probability": {                  |
|                     |             |     "False": 9.992687686000000e-01, |
|                     |             |     "True": 7.312314000000000e-04   |
|                     |             |   }                                 |
|                     |             | }                                   |
| 1                   |           3 | {                                   |
|                     |             |   "class": "False",                 |
|                     |             |   "logs": null,                     |
|                     |             |   "probability": {                  |
|                     |             |     "False": 9.992951615000000e-01, |
|                     |             |     "True": 7.048385000000000e-04   |
|                     |             |   }                                 |
|                     |             | }                                   |
+---------------------+-------------+-------------------------------------+

训练和使用多类别分类器

创建包含用于训练的二元数据的视图。

CREATE OR REPLACE VIEW multiclass_classification_view AS
    SELECT user_interest_score, user_rating, class
FROM training_purchase_data;
SELECT * FROM multiclass_classification_view ORDER BY RANDOM(42) LIMIT 10;

此 SELECT 语句使用以下形式返回结果。

+---------------------+-------------+-----------------+
| USER_INTEREST_SCORE | USER_RATING | CLASS           |
|---------------------+-------------+-----------------|
| 5                   |           4 | add_to_wishlist |
| 8                   |           8 | purchase        |
| 6                   |           5 | add_to_wishlist |
| 7                   |           7 | purchase        |
| 7                   |           4 | add_to_wishlist |
| 1                   |           1 | not_interested  |
| 2                   |           1 | not_interested  |
| 7                   |           3 | add_to_wishlist |
| 2                   |           0 | not_interested  |
| 0                   |           1 | not_interested  |
+---------------------+-------------+-----------------+

现在,从此视图创建多类别分类模型。

CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION model_multiclass(
    INPUT_DATA => SYSTEM$REFERENCE('view', 'multiclass_classification_view'),
    TARGET_COLNAME => 'class'
);

After you’ve created the model, use its PREDICT method to infer labels for the unlabeled purchase data. Use wildcard expansion in an object literal to automatically create key-value pairs for the INPUT_DATA argument.

SELECT *, model_multiclass!PREDICT(INPUT_DATA => {*})
    AS predictions FROM prediction_purchase_data;

该模型使用以下格式返回输出。预测对象包括每个类别的预测概率,以及基于最大预测概率的预测类别。预测的返回顺序与原始特征的提供顺序相同,并且可以在同一个查询中联接。

+---------------------+-------------+-----------------------------------------------+
| USER_INTEREST_SCORE | USER_RATING | PREDICTIONS                                   |
|---------------------+-------------+-----------------------------------------------|
| 9                   |           8 | {                                             |
|                     |             |   "class": "purchase",                        |
|                     |             |   "logs": null,                               |
|                     |             |   "probability": {                            |
|                     |             |     "add_to_wishlist": 3.529288000000000e-04, |
|                     |             |     "not_interested": 2.259768000000000e-04,  |
|                     |             |     "purchase": 9.994210944000000e-01         |
|                     |             |   }                                           |
|                     |             | }                                             |
| 3                   |           0 | {                                             |
|                     |             |   "class": "not_interested",                  |
|                     |             |   "logs": null,                               |
|                     |             |   "probability": {                            |
|                     |             |     "add_to_wishlist": 3.201690000000000e-04, |
|                     |             |     "not_interested": 9.994749885000000e-01,  |
|                     |             |     "purchase": 2.048425000000000e-04         |
|                     |             |   }                                           |
|                     |             | }                                             |
| 10                  |           7 | {                                             |
|                     |             |   "class": "purchase",                        |
|                     |             |   "logs": null,                               |
|                     |             |   "probability": {                            |
|                     |             |     "add_to_wishlist": 1.271809310000000e-02, |
|                     |             |     "not_interested": 3.992673600000000e-03,  |
|                     |             |     "purchase": 9.832892333000000e-01         |
|                     |             |   }                                           |
|                     |             | }                                             |
| 6                   |           6 | {                                             |
|                     |             |   "class": "add_to_wishlist",                 |
|                     |             |   "logs": null,                               |
|                     |             |   "probability": {                            |
|                     |             |     "add_to_wishlist": 9.999112027000000e-01, |
|                     |             |     "not_interested": 4.612520000000000e-05,  |
|                     |             |     "purchase": 4.267210000000000e-05         |
|                     |             |   }                                           |
|                     |             | }                                             |
| 1                   |           3 | {                                             |
|                     |             |   "class": "not_interested",                  |
|                     |             |   "logs": null,                               |
|                     |             |   "probability": {                            |
|                     |             |     "add_to_wishlist": 2.049559150000000e-02, |
|                     |             |     "not_interested": 9.759854413000000e-01,  |
|                     |             |     "purchase": 3.518967300000000e-03         |
|                     |             |   }                                           |
|                     |             | }                                             |
+---------------------+-------------+-----------------------------------------------+

将结果保存到表中并浏览预测

模型的 PREDICT 方法调用的结果可直接读取到查询中,但将结果保存到表中可以方便地浏览预测。

CREATE OR REPLACE TABLE my_predictions AS
SELECT *, model_multiclass!PREDICT(INPUT_DATA => {*}) AS predictions FROM prediction_purchase_data;

SELECT * FROM my_predictions;

然后可在后续查询中浏览关键列和预测列。以下查询用于浏览预测。

SELECT
    predictions:class AS predicted_class,
    ROUND(predictions:probability:not_interested,4) AS not_interested_class_probability,
    ROUND(predictions['probability']['purchase'],4) AS purchase_class_probability,
    ROUND(predictions['probability']['add_to_wishlist'],4) AS add_to_wishlist_class_probability
FROM my_predictions
LIMIT 5;

上面的查询使用以下形式返回结果。

+-------------------+----------------------------------+----------------------------+-----------------------------------+
| PREDICTED_CLASS   | NOT_INTERESTED_CLASS_PROBABILITY | PURCHASE_CLASS_PROBABILITY | ADD_TO_WISHLIST_CLASS_PROBABILITY |
|-------------------+----------------------------------+----------------------------+-----------------------------------|
| "purchase"        |                           0.0002 |                     0.9994 |                            0.0004 |
| "not_interested"  |                           0.9995 |                     0.0002 |                            0.0003 |
| "purchase"        |                           0.0002 |                     0.9994 |                            0.0004 |
| "purchase"        |                           0.0002 |                     0.9994 |                            0.0004 |
| "not_interested"  |                           0.9994 |                     0.0002 |                            0.0004 |
| "purchase"        |                           0.0002 |                     0.9994 |                            0.0004 |
| "add_to_wishlist" |                           0      |                     0      |                            0.9999 |
| "add_to_wishlist" |                           0.4561 |                     0.0029 |                            0.5409 |
| "purchase"        |                           0.0002 |                     0.9994 |                            0.0004 |
| "not_interested"  |                           0.9994 |                     0.0002 |                            0.0003 |
+-------------------+----------------------------------+----------------------------+-----------------------------------+

使用评估函数

默认情况下,所有实例都会启用评估。但可以使用配置对象实参来手动启用或禁用评估。如果使用 FALSE 值指定了“evaluate”关键字,则评估不可用。

CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION model(
    INPUT_DATA => SYSTEM$REFERENCE('view', 'binary_classification_view'),
    TARGET_COLNAME => 'label',
    CONFIG_OBJECT => {'evaluate': TRUE}
);

启用评估后,可以使用此处所示的评估 APIs 来获取评估指标。

CALL model!SHOW_EVALUATION_METRICS();
CALL model!SHOW_GLOBAL_EVALUATION_METRICS();
CALL model!SHOW_THRESHOLD_METRICS();
CALL model!SHOW_CONFUSION_MATRIX();

See 了解评估指标 for a description of the returned metrics.

我们的多类别模型的评估指标如下。

CALL model_multiclass!SHOW_EVALUATION_METRICS();
+--------------+-----------------+--------------+---------------+------+
| DATASET_TYPE | CLASS           | ERROR_METRIC |  METRIC_VALUE | LOGS |
|--------------+-----------------+--------------+---------------+------|
| EVAL         | add_to_wishlist | precision    |  0.8888888889 | NULL |
| EVAL         | add_to_wishlist | recall       |  1            | NULL |
| EVAL         | add_to_wishlist | f1           |  0.9411764706 | NULL |
| EVAL         | add_to_wishlist | support      | 16            | NULL |
| EVAL         | not_interested  | precision    |  1            | NULL |
| EVAL         | not_interested  | recall       |  0.9090909091 | NULL |
| EVAL         | not_interested  | f1           |  0.9523809524 | NULL |
| EVAL         | not_interested  | support      | 22            | NULL |
| EVAL         | purchase        | precision    |  1            | NULL |
| EVAL         | purchase        | recall       |  1            | NULL |
| EVAL         | purchase        | f1           |  1            | NULL |
| EVAL         | purchase        | support      | 22            | NULL |
+--------------+-----------------+--------------+---------------+------+
CALL model_multiclass!SHOW_GLOBAL_EVALUATION_METRICS();
+--------------+--------------+--------------+---------------+------+
| DATASET_TYPE | AVERAGE_TYPE | ERROR_METRIC |  METRIC_VALUE | LOGS |
|--------------+--------------+--------------+---------------+------|
| EVAL         | macro        | precision    | 0.962962963   | NULL |
| EVAL         | macro        | recall       | 0.9696969697  | NULL |
| EVAL         | macro        | f1           | 0.964519141   | NULL |
| EVAL         | macro        | auc          | 0.9991277911  | NULL |
| EVAL         | weighted     | precision    | 0.9703703704  | NULL |
| EVAL         | weighted     | recall       | 0.9666666667  | NULL |
| EVAL         | weighted     | f1           | 0.966853408   | NULL |
| EVAL         | weighted     | auc          | 0.9991826156  | NULL |
| EVAL         | NULL         | log_loss     | 0.06365200147 | NULL |
+--------------+--------------+--------------+---------------+------+
CALL model_multiclass!SHOW_CONFUSION_MATRIX();
+--------------+-----------------+-----------------+-------+------+
| DATASET_TYPE | ACTUAL_CLASS    | PREDICTED_CLASS | COUNT | LOGS |
|--------------+-----------------+-----------------+-------+------|
| EVAL         | add_to_wishlist | add_to_wishlist |    16 | NULL |
| EVAL         | add_to_wishlist | not_interested  |     0 | NULL |
| EVAL         | add_to_wishlist | purchase        |     0 | NULL |
| EVAL         | not_interested  | add_to_wishlist |     2 | NULL |
| EVAL         | not_interested  | not_interested  |    20 | NULL |
| EVAL         | not_interested  | purchase        |     0 | NULL |
| EVAL         | purchase        | add_to_wishlist |     0 | NULL |
| EVAL         | purchase        | not_interested  |     0 | NULL |
| EVAL         | purchase        | purchase        |    22 | NULL |
+--------------+-----------------+-----------------+-------+------+

Note

For information on threshold metrics, see SHOW_THRESHOLD_METRICS.

我们还可以查看特征重要性。

CALL model_multiclass!SHOW_FEATURE_IMPORTANCE();
+------+---------------------+---------------+---------------+
| RANK | FEATURE             |         SCORE | FEATURE_TYPE  |
|------+---------------------+---------------+---------------|
|    1 | USER_RATING         | 0.9186571982  | user_provided |
|    2 | USER_INTEREST_SCORE | 0.08134280181 | user_provided |
+------+---------------------+---------------+---------------+

模型角色和使用权限

Each classification model instance includes two model roles, mladmin and mlconsumer. These roles are scoped to the model itself: model!mladmin and model!mlconsumer. The owner of the model object (initially, its creator) is automatically granted the model!mladmin and model!mlconsumer roles, and can grant these roles to account roles and database roles.

The mladmin role permits usage of all APIs invocable from the model object, including but not limited to prediction methods and evaluation methods. The mlconsumer role permits usage only on prediction APIs, not other exploratory APIs.

The following SQL example illustrates the grant of classification model roles to other roles. The role r1 can create a classification model, and grants the role r2 the mlconsumer privilege so that the r2 can call that model’s PREDICT method. Then r1 grants the mladmin role to another role, r3, so r3 can call all methods of the model.

First, role r1 creates a model object, making r1 the owner of the model model.

USE ROLE r1;
CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION model(
    INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'test_classification_dataset'),
    TARGET_COLNAME => 'LABEL'
);

You can see by executing the statements below that the role r2 cannot call the model’s PREDICT method.

USE ROLE r2;
SELECT model!PREDICT(1);    -- privilege error

Next, r1 grants r2 the mlconsumer instance role, after which r2 can call the model’s PREDICT method.

USE ROLE r1;
GRANT SNOWFLAKE.ML.CLASSIFICATION ROLE model!mlconsumer TO ROLE r2;

USE ROLE r2;
CALL model!PREDICT(
    INPUT_DATA => system$query_reference(
    'SELECT {*} FROM test_classification_dataset')
);

Similarly, role r3 cannot see the model’s evaluation metrics without the mladmin instance role.

USE ROLE r3;
CALL model!SHOW_EVALUATION_METRICS();   -- privilege error

Role r1 grants the required role to r3, and r3 can then call the model’s SHOW_EVALUATION_METRICS method.

USE ROLE r1;
GRANT SNOWFLAKE.ML.CLASSIFICATION ROLE model!mladmin TO ROLE r3;

USE ROLE r3;
CALL model!SHOW_EVALUATION_METRICS();

您可按如下方式撤消权限。

USE ROLE r1;
REVOKE SNOWFLAKE.ML.CLASSIFICATION ROLE model!mlconsumer FROM ROLE r2;
REVOKE SNOWFLAKE.ML.CLASSIFICATION ROLE model!mladmin FROM ROLE r3;

使用以下命令查看各实例角色已被授予哪些账户角色和数据库角色。

SHOW GRANTS TO SNOWFLAKE.ML.CLASSIFICATION ROLE <model_name>!mladmin;
SHOW GRANTS TO SNOWFLAKE.ML.CLASSIFICATION ROLE <model_name>!mlconsumer;

了解评估指标

Metrics measure how accurately a model predicts new data. The Snowflake classification currently evaluates models by selecting a random sample from the entire dataset. A new model is trained without these rows, and then the rows are used as inference input. The random sample portion can be configured using the test_fraction key in the EVALUATION_CONFIG object.

Metrics in show_evaluation_metrics

show_evaluation_metrics calculates the following values for each class. See SHOW_EVALUATION_METRICS.

  • 正实例:属于感兴趣类别或预测类别的数据实例(行)。
  • 负实例:不属于感兴趣类别或与预测值相反的数据实例(行)。
  • 真阳性 (TP):对阳性实例的正确预测。
  • 真阴性 (TN):对阴性实例的正确预测,
  • 假阳性 (FP):对阳性实例的不正确预测
  • 假阴性 (FN):对阴性实例的不正确预测。

使用上述值,报告各类别的以下指标。对于每个指标,值越大,就表示模型的预测性越强。

  • Precision: The ratio of true positives to the total predicted positives. It measures how many of the predicted positive instances are actually positive.
  • Recall (Sensitivity): The ratio of true positives to the total actual positives. It measures how many of the actual positive instances were correctly predicted.
  • F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall, especially when there is an uneven class distribution.

Metrics in show_global_evaluation_metrics

show_global_evaluation_metrics calculates overall (global) metrics for all classes predicted by the model by averaging the per-class metrics calculated by show_evaluation_metrics. See SHOW_GLOBAL_EVALUATION_METRICS.

Currently, macro and weighted averaging is used for the metrics Precision, Recall, F1, AUC.

逻辑损失 (LogLoss) 是针对整个模型计算的。预测的目标是最小化损失函数。

Metrics in show_threshold_metrics

show_threshold_metrics provides raw counts and metrics for a specific threshold for each class. This can be used to plot ROC and PR curves or do threshold tuning if desired. The threshold varies from 0 to 1 for each specific class; a predicted probability is assigned. See SHOW_THRESHOLD_METRICS.

The sample is classified as belonging to a class if the predicted probability of being in that class exceeds the specified threshold. The true and false positives and negatives are computed considering the negative class as every instance that does not belong to the class being considered. The following metrics are then computed.

  • 真阳性率 (TPR):模型正确识别的实际阳性实例的比例(等同于查全率)。
  • 假阳性率 (FPR):被错误预测为阳性、实则为阴性实例的比例。
  • Accuracy: The ratio of correct predictions (both true positives and true negatives) to the total number of predictions, an overall measure of how well the model is performing. This metric can be misleading in unbalanced cases.
  • Support: The number of actual occurrences of a class in the specified dataset. Higher support values indicate a larger representation of a class in the dataset. Support is not itself a metric of the model but a characteristic of the dataset.

Confusion Matrix in show_confusion_matrix

The confusion matrix is a table used to assess the performance of a model by comparing predicted and actual values and evaluating its ability to correctly identify positive and negative instances. The objective is to maximize the number of instances on the diagonal of the matrix while minimizing the number of off-diagonal instances. See SHOW_CONFUSION_MATRIX.

您可按如下方式,在 Snowsight 中对混淆矩阵进行可视化。

CALL model_binary!SHOW_CONFUSION_MATRIX();

结果如下所示。

+--------------+--------------+-----------------+-------+------+
| DATASET_TYPE | ACTUAL_CLASS | PREDICTED_CLASS | COUNT | LOGS |
|--------------+--------------+-----------------+-------+------|
| EVAL         | false        | false           |    37 | NULL |
| EVAL         | false        | true            |     1 | NULL |
| EVAL         | true         | false           |     0 | NULL |
| EVAL         | true         | true            |    22 | NULL |
+--------------+--------------+-----------------+-------+------+

To visualize the confusion matrix, click on Chart, then Chart Type, then Heatgrid. Under Data, for Cell values select NONE, for Rows select PREDICTED_CLASS, and for Columns select ACTUAL_CLASS. The result appears similar to the figure below.

The confusion matrix of a comparison model

了解特征的重要性

分类模型可解释模型中所用全部特征的相对重要性。此信息有助于了解哪些因素真正影响到了您的数据。

The SHOW_FEATURE_IMPORTANCE method counts the number of times the model’s trees used each feature to make a decision. These feature importance scores are then normalized to values between 0 and 1 so that their sum is 1. The resulting scores represent an approximate ranking of the features in your trained model.

得分接近的特征具有相似的重要性。使用彼此非常相似的多个特征可能会降低这些特征的重要性得分。

限制

  • 不能选择用于计算特征重要性的技术。
  • Feature importance scores can be helpful for gaining intuition about which features are important to your model’s accuracy, but the actual values should be considered estimates.

示例

CALL model_binary!SHOW_FEATURE_IMPORTANCE();
+------+---------------------+---------------+---------------+
| RANK | FEATURE             |         SCORE | FEATURE_TYPE  |
|------+---------------------+---------------+---------------|
|    1 | USER_RATING         | 0.9295302013  | user_provided |
|    2 | USER_INTEREST_SCORE | 0.07046979866 | user_provided |
+------+---------------------+---------------+---------------+

成本注意事项

训练和使用分类模型会产生计算和存储成本。

Using any APIs from the Classification feature (training a model, predicting with the model, retrieving metrics) all require an active warehouse. The compute cost of using Classification functions is charged to the warehouse. See Understanding Compute Cost for general information on Snowflake compute costs.

For details on costs for using ML functions in general, see Cost Considerations in the ML functions overview.