CREATE SNOWFLAKE.ML.ANOMALY_DETECTION¶
Creates a new anomaly detection model or replaces an existing one using the training data you provide.
Syntax¶
CREATE [ OR REPLACE ] SNOWFLAKE.ML.ANOMALY_DETECTION <model_name>(
INPUT_DATA => <reference_to_training_data>,
[ SERIES_COLNAME => '<series_column_name>', ]
TIMESTAMP_COLNAME => '<timestamp_column_name>',
TARGET_COLNAME => '<target_column_name>',
LABEL_COLNAME => '<label_column_name>',
[ CONFIG_OBJECT => <config_object> ]
)
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]
Parameters¶
model_name
Specifies the identifier (model_name) for the anomaly detector object; must be unique for the schema in which the object is created.
In addition, the identifier must start with an alphabetic character and cannot contain spaces or special characters unless the entire identifier string is enclosed in double quotes (for example,
"My object"
). Identifiers enclosed in double quotes are also case-sensitive. For more details, see Identifier requirements.
Constructor arguments¶
Required:
INPUT_DATA => reference_to_training_data
Specifies a reference to the table, view, or query that returns the training data for the model.
To create this reference, you can use the TABLE keyword with the table name, view name, or query, or you can call the SYSTEM$REFERENCE or SYSTEM$QUERY_REFERENCE function.
TIMESTAMP_COLNAME => 'timestamp_column_name'
Specifies the name of the column containing the timestamps (TIMESTAMP_NTZ) in the time series data.
TARGET_COLNAME => 'target_column_name'
Specifies the name of the column containing the data (NUMERIC or FLOAT) to analyze.
LABEL_COLNAME => 'label_column_name'
Specifies the name of the column containing the labels for the data. Labels are Boolean (true/false) values indicating whether a given row is a known anomaly. If you do not have labeled data, pass an empty string (
''
) for this argument.
Optional:
SERIES_COLNAME => 'series_column_name'
Name of the column containing the identifier for the series (for multi-series data). This column should be a VARIANT because it can be any kind of value or a combination of values from more than one column in an array.
CONFIG_OBJECT => config_object
An OBJECT containing key-value pairs used to configure the model training job.
Key
Type
Default
Description
aggregation_categorical
'MODE'
The aggregation method for categorical features. Supported values are:
'MODE'
: The most frequent value.'FIRST'
: The earliest value.'LAST'
: The latest value.
aggregation_numeric
'MEAN'
The aggregation method for numeric features. Supported values are:
'MEAN'
: The average of the values.'MEDIAN'
: The middle value.MODE
: The most frequent value.'MIN'
: The smallest value.'MAX'
: The largest value.'SUM'
: The total of the values.'FIRST'
: The earliest value.'LAST'
: The latest value.
aggregation_target
Same as
aggregation_numeric
, or'MEAN'
if not specifiedThe aggregation method for the target value. Supported values are:
'MEAN'
: The average of the values.'MEDIAN'
: The middle value.MODE
: The most frequent value.'MIN'
: The smallest value.'MAX'
: The largest value.'SUM'
: The total of the values.'FIRST'
: The earliest value.'LAST'
: The latest value.
evaluate
TRUE
Whether evaluation metrics should be generated. If TRUE, additional models are trained for cross-validation using the parameters in the
evaluation_config
.evaluation_config
An optional config object to specify how out-of-sample evaluation metrics should be generated. See next section.
frequency
n/a
The frequency of the time series. If not specified, the model infers the frequency. The value must be a string representing a time period, such as
'1 day'
. Supported units include seconds, minutes, hours, days, weeks, months, quarters, and years. You may use singular (“hour”) or plural (“hours”) for the interval name, but may not abbreviate.lower_bound
FLOAT or NULL
NULL
The lower bound for the target value. If specified, the model will not predict values below this threshold.
upper_bound
FLOAT or NULL
NULL
The upper bound for the target value. If specified, the model will not predict values above this threshold.
on_error
'ABORT'
String (constant) that specifies the error handling method for training. This is most useful when training multiple series. Supported values are:
'abort'
: Abort training if an error is encountered in any time series.'skip'
: Skip any time series where training encounters an error. This allows training to succeed for other time series. To see which series failed during model training, call the model’s <model_name>!SHOW_TRAINING_LOGS method.
Evaluation configuration¶
The evaluation_config
object contains key-value pairs that configure cross-validation. These parameters are from the scikit-learn
TimeSeriesSplit (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
cross-validator.
Key |
Type |
Default |
Description |
---|---|---|---|
|
5 |
Number of splits. |
|
|
INTEGER or NULL (no maximum). |
NULL |
Maximum size for a single training set. |
|
INTEGER or NULL. |
NULL |
Used to limit the size of the test set. |
|
0 |
Number of samples to exclude from the end of each training set before the test set. |
|
|
0.95 |
The prediction interval used in calculating interval metrics. |
Usage notes¶
If the column names specified by the TIMESTAMP_COLNAME, TARGET_COLNAME, or LABEL_COLNAME arguments do not exist in the table, view, or query specified by the INPUT_DATA argument, an error occurs.
Replication isn’t supported for class instances except for instances of the CUSTOM_CLASSIFIER class.
Examples¶
For a representative example, see the anomaly detection example.