CREATE SNOWFLAKE.ML.FORECAST

根据所提供的训练数据创建新的预测模型,或替换同名预测模型。

语法

CREATE [ OR REPLACE ] SNOWFLAKE.ML.FORECAST [ IF NOT EXISTS ] <model_name>(
  INPUT_DATA => <input_data>,
  [ SERIES_COLNAME => '<series_colname>', ]
  TIMESTAMP_COLNAME => '<timestamp_colname>',
  TARGET_COLNAME => '<target_colname>',
  [ CONFIG_OBJECT => <config_object> ]
)
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]

Note

使用命名实参将导致实参顺序变得无关紧要,并生成更具可读性的代码。但也可以使用位置实参,如以下示例所示:

CREATE SNOWFLAKE.ML.FORECAST <name>(
  '<input_data>', '<series_colname>', '<timestamp_colname>', '<target_colname>'
);

参数

model_name

指定模型的标识符;对于在其中创建模型的架构来说,此标识符必须唯一。

If the model identifier is not fully qualified (in the form of db_name.schema_name.name or schema_name.name), the command creates the model in the current schema for the session.

In addition, the identifier must start with an alphabetic character and cannot contain spaces or special characters unless the entire identifier string is enclosed in double quotes (for example, "My object"). Identifiers enclosed in double quotes are also case-sensitive.

For more details, see Identifier requirements.

构造函数实参

必填:

INPUT_DATA => input_data

A reference to the input data. Using a reference allows the training process, which runs with limited privileges, to use your privileges to access the data. You can use a reference to a table or a view if your data is already in that form, or you can use a query reference to provide the query to be executed to obtain the data.

To create this reference, you can use the TABLE keyword with the table name, view name, or query, or you can call the SYSTEM$REFERENCE or SYSTEM$QUERY_REFERENCE function.

The referenced data is the entire training data consumed by the forecasting model. If input_data contains any columns that are not named as timestamp_colname, target_colname, or series_colname, they are considered exogenous variables (additional features). Order of the columns in the input data is not important.

Your input data must have columns with appropriate types for your use case. See Examples for details on each use case.

Use CaseColumns and types
Single time series
Multiple time series
Single time series with exogenous variables
Multiple time series with exogenous variables
TIMESTAMP_COLNAME => 'timestamp_colname'

Name of the column containing the timestamps in input_data.

TARGET_COLNAME => 'target_colname'

Name of the column containing the target (dependent value) in input_data.

可选:

SERIES_COLNAME => 'series_colname'

For multiple time-series models, the name of the column defining the multiple time series in input_data. This column can be a value of any type, or an array of values from one or more other columns, as shown in Forecast on multiple series.

如果按位置提供实参,则此实参必须为 第二个 实参。

CONFIG_OBJECT => config_object

An OBJECT containing key-value pairs used to configure the model training job.

KeyTypeDefaultDescription
aggregation_categoricalSTRING'MODE'

分类特征的聚合方法。支持的值包括:

  • 'MODE': The most frequent value.
  • 'FIRST': The earliest value.
  • 'LAST': The latest value.
aggregation_numericSTRING'MEAN'

数字特征的聚合方法。支持的值包括:

  • 'MEAN': The average of the values.
  • 'MEDIAN': The middle value.
  • MODE: The most frequent value.
  • 'MIN': The smallest value.
  • 'MAX': The largest value.
  • 'SUM': The total of the values.
  • 'FIRST': The earliest value.
  • 'LAST': The latest value.
aggregation_targetSTRINGSame as aggregation_numeric, or 'MEAN' if not specified

目标值的聚合方法。支持的值包括:

  • 'MEAN': The average of the values.
  • 'MEDIAN': The middle value.
  • MODE: The most frequent value.
  • 'MIN': The smallest value.
  • 'MAX': The largest value.
  • 'SUM': The total of the values.
  • 'FIRST': The earliest value.
  • 'LAST': The latest value.
aggregation_columnObjectn/a

An object containing key-value pairs (both strings) that specify the aggregation method for specific columns. The key is the column name, and the value is the aggregation method. If a column is not specified, the model uses the method specified by aggregation_numeric or aggregation_categorical, or the default for that column type (MEAN for numeric, MODE for categorical).

evaluateBOOLEANTRUE

Whether evaluation metrics should be generated. If TRUE, then additional models are trained for cross-validation using the parameters in the evaluation_config.

evaluation_configOBJECTSee 评估配置 below.A optional config object to specify how out-of-sample evaluation metrics should be generated.
frequencySTRINGn/a

The frequency of the time series. If not specified, the model infers the frequency. The value must be a string representing a time period, such as '1 day'. Supported units include seconds, minutes, hours, days, weeks, months, quarters, and years. You may use singular (“hour”) or plural (“hours”) for the interval name, but may not abbreviate.

methodSTRING'best'

指定用于训练模型的算法的字符串(常量)。支持的值包括:

  • 'best': Uses an ensemble of models to determine the best algorithm for the data. This ensemble includes Prophet (https://facebook.github.io/prophet/), ARIMA , Exponential Smoothing , and a gradient boosting machine (GBM) based algorithm.
  • 'fast': Uses a single algorithm - a GBM based algorithm - to train the model. This option is faster than the 'best' option, but may not be as accurate. We recommend using ‘fast’ when your training data has 10,000 or more individual series.
lower_boundFLOAT or NULLNULLThe lower bound for the target value. If specified, the model will not predict values below this threshold.
upper_boundFLOAT or NULLNULLThe upper bound for the target value. If specified, the model will not predict values above this threshold.
on_errorSTRING'ABORT'

指定模型训练任务的错误处理方法的字符串(常量)。这在训练多个序列时最有用。支持的值包括:

  • 'abort': Abort the training operation if an error is encountered in any time series.
  • 'skip': Skip any time series where training encounters an error. This allows model training to succeed for other time series. To see which series failed, use the model’s method.

评估配置

The evaluation_config object contains key-value pairs that configure cross-validation. These parameters are from scikit-learn’s TimeSeriesSplit (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html).

类型默认值描述
n_splitsINTEGER1拆分数。
max_train_sizeINTEGER or NULL (no maximum).NULL单个训练集的最大大小。
test_sizeINTEGER or NULL.NULL用于限制测试集的大小。
gapINTEGER0在测试集之前,要从每个训练集结束时排除的样本数。
prediction_intervalFLOAT0.95用于计算区间指标的预测区间。

使用说明

Replication is supported only for instances of the CUSTOM_CLASSIFIER class.

示例

See Examples.