CREATE SNOWFLAKE.ML.ANOMALY_DETECTION¶
Creates a new anomaly detection model or replaces an existing one using the training data you provide.
Syntax¶
CREATE [ OR REPLACE ] SNOWFLAKE.ML.ANOMALY_DETECTION <model_name>(
  INPUT_DATA => <reference_to_training_data>,
  [ SERIES_COLNAME => '<series_column_name>', ]
  TIMESTAMP_COLNAME => '<timestamp_column_name>',
  TARGET_COLNAME => '<target_column_name>',
  LABEL_COLNAME => '<label_column_name>',
  [ CONFIG_OBJECT => <config_object> ]
)
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]
Parameters¶
- model_name
- Specifies the identifier (model_name) for the anomaly detector object; must be unique for the schema in which the object is created. - In addition, the identifier must start with an alphabetic character and cannot contain spaces or special characters unless the entire identifier string is enclosed in double quotes (for example, - "My object"). Identifiers enclosed in double quotes are also case-sensitive. For more details, see Identifier requirements.
Constructor arguments¶
Required:
- INPUT_DATA => reference_to_training_data
- Specifies a reference to the table, view, or query that returns the training data for the model. - To create this reference, you can use the TABLE keyword with the table name, view name, or query, or you can call the SYSTEM$REFERENCE or SYSTEM$QUERY_REFERENCE function. 
- TIMESTAMP_COLNAME => 'timestamp_column_name'
- Specifies the name of the column containing the timestamps (TIMESTAMP_NTZ) in the time series data. 
- TARGET_COLNAME => 'target_column_name'
- Specifies the name of the column containing the data (NUMERIC or FLOAT) to analyze. 
- LABEL_COLNAME => 'label_column_name'
- Specifies the name of the column containing the labels for the data. Labels are Boolean (true/false) values indicating whether a given row is a known anomaly. If you do not have labeled data, pass an empty string ( - '') for this argument.
Optional:
- SERIES_COLNAME => 'series_column_name'
- Name of the column containing the identifier for the series (for multi-series data). This column should be a VARIANT because it can be any kind of value or a combination of values from more than one column in an array. 
- CONFIG_OBJECT => config_object
- An OBJECT containing key-value pairs used to configure the model training job. - Key - Type - Default - Description - aggregation_categorical- 'MODE'- The aggregation method for categorical features. Supported values are: - 'MODE': The most frequent value.
- 'FIRST': The earliest value.
- 'LAST': The latest value.
 - aggregation_numeric- 'MEAN'- The aggregation method for numeric features. Supported values are: - 'MEAN': The average of the values.
- 'MEDIAN': The middle value.
- MODE: The most frequent value.
- 'MIN': The smallest value.
- 'MAX': The largest value.
- 'SUM': The total of the values.
- 'FIRST': The earliest value.
- 'LAST': The latest value.
 - aggregation_target- Same as - aggregation_numeric, or- 'MEAN'if not specified- The aggregation method for the target value. Supported values are: - 'MEAN': The average of the values.
- 'MEDIAN': The middle value.
- MODE: The most frequent value.
- 'MIN': The smallest value.
- 'MAX': The largest value.
- 'SUM': The total of the values.
- 'FIRST': The earliest value.
- 'LAST': The latest value.
 - evaluate- TRUE - Whether evaluation metrics should be generated. If TRUE, additional models are trained for cross-validation using the parameters in the - evaluation_config.- evaluation_config- An optional config object to specify how out-of-sample evaluation metrics should be generated. See next section. - frequency- n/a - The frequency of the time series. If not specified, the model infers the frequency. The value must be a string representing a time period, such as - '1 day'. Supported units include seconds, minutes, hours, days, weeks, months, quarters, and years. You may use singular (“hour”) or plural (“hours”) for the interval name, but may not abbreviate.- lower_bound- FLOAT or NULL - NULL - The lower bound for the target value. If specified, the model will not predict values below this threshold. - upper_bound- FLOAT or NULL - NULL - The upper bound for the target value. If specified, the model will not predict values above this threshold. - on_error- 'ABORT'- String (constant) that specifies the error handling method for training. This is most useful when training multiple series. Supported values are: - 'abort': Abort training if an error is encountered in any time series.
- 'skip': Skip any time series where training encounters an error. This allows training to succeed for other time series. To see which series failed during model training, call the model’s <model_name>!SHOW_TRAINING_LOGS method.
 
Evaluation configuration¶
The evaluation_config object contains key-value pairs that configure cross-validation. These parameters are from the scikit-learn
TimeSeriesSplit (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
cross-validator.
| Key | Type | Default | Description | 
|---|---|---|---|
| 
 | 5 | Number of splits. | |
| 
 | INTEGER or NULL (no maximum). | NULL | Maximum size for a single training set. | 
| 
 | INTEGER or NULL. | NULL | Used to limit the size of the test set. | 
| 
 | 0 | Number of samples to exclude from the end of each training set before the test set. | |
| 
 | 0.95 | The prediction interval used in calculating interval metrics. | 
Usage notes¶
- If the column names specified by the TIMESTAMP_COLNAME, TARGET_COLNAME, or LABEL_COLNAME arguments do not exist in the table, view, or query specified by the INPUT_DATA argument, an error occurs. 
- Replication is supported only for instances of the CUSTOM_CLASSIFIER class. 
Examples¶
For a representative example, see the anomaly detection example.