The classification function is powered by a
gradient boosting machine.
For binary classification, the model is trained using an
area-under-the-curve
loss function. For multi-class classification, the model is trained using a
logistic loss function.
适合用于分类的训练数据集包括代表各数据点标注类别的目标列,以及至少一个特征列。
分类模型支持特征和标签的数字、布尔和字符串数据类型。
数字特征被视为连续特征。要将数值特征视为分类特征,请将其转换为字符串。
String features are treated as categorical. The classification function supports high-cardinality features (for
example, job titles or fruits). It does not support full free text, like sentences or paragraphs.
布尔特征被视为分类特征。
Timestamps must be TIMESTAMP_NTZ type. The model creates additional
time-based features (epoch, day, week, month), which are used in training and classification. These features appear in
SHOW_FEATURE_IMPORTANCE results as
derived_features.
Training and inference data must be numeric, TIMESTAMP_NTZ, Boolean, or string. Other types must be cast to one of
these types.
您不能选择或修改分类算法。
无法手动指定或调整模型参数。
In tests, training on a Medium Snowpark-optimized warehouse has succeeded with up to 1,000 columns and 600 million
rows. It is possible, but unlikely, to run out of memory below this limit.
目标列包含的非重复类不得超过 255 个。
SNOWFLAKE.ML.CLASSIFICATION instances cannot be cloned. When you clone or replicate a database containing a
classification model, the model is currently skipped.
A Snowflake virtual warehouse provides the compute resources for training and using
classification machine learning models. This section provides general guidance on selecting the best type and size of
warehouse for classification, focusing on the training step, the most time-consuming and memory-intensive part of the
process.
To use this schema, a user assumes the role analyst:
USEROLE analyst;USESCHEMA admin_db.admin_schema;
If the analyst role has CREATE SCHEMA privileges in database analyst_db, the role can create a new schema
analyst_db.analyst_schema and create classification models in that schema:
SNOWFLAKE.ML.CLASSIFICATION runs using limited privileges, so by default, it does not have access to your data. You
must therefore pass tables and views as references,
which pass along the caller’s privileges. You can also provide a query reference
instead of a reference to a table or a view.
See the CLASSIFICATION reference for information about training, inference, and evaluation APIs.
For examples of using these methods, see Examples.
要查看所有分类模型,请使用 SHOW 命令。
SHOWSNOWFLAKE.ML.CLASSIFICATION;
要删除分类模型,请使用 DROP 命令。
DROPSNOWFLAKE.ML.CLASSIFICATION<model_name>;
模型是不可变的,不能就地更新。要更新模型,请删除现有模型并训练新模型。针对此目的,CREATE 命令的 CREATE OR REPLACE 变体非常有用。
示例
为示例设置数据
The examples in this topic uses two tables. The first table, training_purchase_data, has two feature columns: a
binary label column and a multi-class label column. The second table is called
prediction_purchase_data and has two feature columns. Use the SQL below to create these
tables.
After you’ve created the model, use its PREDICT method to infer labels for the unlabeled purchase data. You can use
wildcard expansion in an object literal to create key-value pairs of features
for the INPUT_DATA argument.
SELECT model_binary!PREDICT(INPUT_DATA=>{*})AS prediction FROM prediction_purchase_data;
After you’ve created the model, use its PREDICT method to infer labels for the unlabeled purchase data. Use
wildcard expansion in an object literal to automatically create key-value pairs
for the INPUT_DATA argument.
SELECT*, model_multiclass!PREDICT(INPUT_DATA=>{*})AS predictions FROM prediction_purchase_data;
Each classification model instance includes two model roles, mladmin and mlconsumer. These roles are scoped to the
model itself: model!mladmin and model!mlconsumer. The owner of the model object (initially, its creator) is
automatically granted the model!mladmin and model!mlconsumer roles, and can grant these roles to account roles
and database roles.
The mladmin role permits usage of all APIs invocable from the model object, including but not limited to prediction
methods and evaluation methods. The mlconsumer role permits usage only on prediction APIs, not other exploratory
APIs.
The following SQL example illustrates the grant of classification model roles to other roles. The role r1 can create
a classification model, and grants the role r2 the mlconsumer privilege so that the r2 can call that model’s
PREDICT method. Then r1 grants the mladmin role to another role, r3, so r3 can call all methods of the
model.
First, role r1 creates a model object, making r1 the owner of the model model.
Metrics measure how accurately a model predicts new data. The Snowflake classification currently evaluates models by selecting a
random sample from the entire dataset. A new model is trained without these rows, and then the rows are used as inference input.
The random sample portion can be configured using the test_fraction key in the EVALUATION_CONFIG object.
show_evaluation_metrics calculates the following values for each class. See
SHOW_EVALUATION_METRICS.
正实例:属于感兴趣类别或预测类别的数据实例(行)。
负实例:不属于感兴趣类别或与预测值相反的数据实例(行)。
真阳性 (TP):对阳性实例的正确预测。
真阴性 (TN):对阴性实例的正确预测,
假阳性 (FP):对阳性实例的不正确预测
假阴性 (FN):对阴性实例的不正确预测。
使用上述值,报告各类别的以下指标。对于每个指标,值越大,就表示模型的预测性越强。
Precision: The ratio of true positives to the total predicted positives. It measures how many of the predicted
positive instances are actually positive.
Recall (Sensitivity): The ratio of true positives to the total actual positives. It measures how many of the actual
positive instances were correctly predicted.
F1 Score: The harmonic mean of precision and recall. It provides a balance between precision and recall, especially
when there is an uneven class distribution.
show_global_evaluation_metrics calculates overall (global) metrics for all classes predicted by the model by averaging the per-class metrics
calculated by show_evaluation_metrics. See
SHOW_GLOBAL_EVALUATION_METRICS.
Currently, macro and weighted averaging is used for the metrics Precision, Recall, F1, AUC.
show_threshold_metrics provides raw counts and metrics for a specific threshold for each class. This can be used to
plot ROC and PR curves or do threshold tuning if desired. The threshold varies from 0 to 1 for each specific class; a
predicted probability is assigned. See SHOW_THRESHOLD_METRICS.
The sample is classified as belonging to a class if the predicted probability of being in that class exceeds the
specified threshold. The true and false positives and negatives are computed considering the negative class as
every instance that does not belong to the class being considered. The following metrics are then computed.
真阳性率 (TPR):模型正确识别的实际阳性实例的比例(等同于查全率)。
假阳性率 (FPR):被错误预测为阳性、实则为阴性实例的比例。
Accuracy: The ratio of correct predictions (both true positives and true negatives) to the total number of
predictions, an overall measure of how well the model is performing. This metric can be misleading in
unbalanced cases.
Support: The number of actual occurrences of a class in the specified dataset. Higher support values indicate a larger
representation of a class in the dataset. Support is not itself a metric of the model but a characteristic of the dataset.
The confusion matrix is a table used to assess the performance of a model by comparing predicted and actual values and
evaluating its ability to correctly identify positive and negative instances. The objective is to maximize the number of
instances on the diagonal of the matrix while minimizing the number of off-diagonal instances. See
SHOW_CONFUSION_MATRIX.
To visualize the confusion matrix, click on Chart, then Chart Type, then Heatgrid. Under Data, for
Cell values select NONE, for Rows select PREDICTED_CLASS, and for Columns select ACTUAL_CLASS. The
result appears similar to the figure below.
了解特征的重要性
分类模型可解释模型中所用全部特征的相对重要性。此信息有助于了解哪些因素真正影响到了您的数据。
The SHOW_FEATURE_IMPORTANCE method counts
the number of times the model’s trees used each feature to make a decision. These feature importance scores are then
normalized to values between 0 and 1 so that their sum is 1. The resulting scores represent an approximate ranking of
the features in your trained model.
得分接近的特征具有相似的重要性。使用彼此非常相似的多个特征可能会降低这些特征的重要性得分。
限制
不能选择用于计算特征重要性的技术。
Feature importance scores can be helpful for gaining intuition about which features are important to your model’s
accuracy, but the actual values should be considered estimates.
Using any APIs from the Classification feature (training a model, predicting with the model, retrieving metrics) all
require an active warehouse. The compute cost of using Classification functions is charged to the warehouse. See
Understanding Compute Cost for general information on Snowflake compute
costs.
For details on costs for using ML functions in general, see Cost Considerations in the ML functions overview.