时间序列预测(Snowflake ML 函数)¶
预测采用机器学习算法,根据历史时间序列数据预测未来的数字数据。一个常见的使用案例是按商品预测未来两周的销售额。
预测快速入门
本节提供了最快捷的预测入门方法。
先决条件
要开始使用,您必须执行以下操作:
- 选择数据库、架构和虚拟仓库。
- 确认您拥有架构或者在所选架构中拥有 CREATE SNOWFLAKE.ML.FORECAST 权限。
- Have a table or view with at least two columns: one timestamp column and one numeric column. Be sure your timestamp column has timestamps at a fixed interval and isn’t missing too many timestamps. The following example shows a dataset with timestamp intervals of one day:
创建预测
Once you have the prerequisites, you can use the AI & ML Studio in Snowsight to guide you through setup or you can use the following SQL commands to train a model and start creating forecasts:
For more details on syntax and available methods, see the FORECAST (SNOWFLAKE.ML) reference.
更深入了解预测
The forecasting function is built to predict any numeric time series data into the future. In addition to the simple case presented in the 预测快速入门 section, you can do the following:
- 同时预测多个系列。例如,您可以预测未来两周多种商品的销售情况。
- 使用特征进行训练和预测。特征是您认为会影响您要预测的指标的其他因素。
- 评估模型的准确性。
- 了解用于训练模型的特征的相对重要性。
- 调试训练错误。
以下各节将举例说明这些情况,并详细介绍如何预测的工作方式。
示例
本节将举例说明如何设置数据以进行预测以及如何根据时间序列数据创建预测模型。
Note
Ideally, the training data for a Forecasting model has time steps at equally spaced intervals (for example, daily). However, model training can handle real-world data that has missing, duplicate, or misaligned time steps. For more information, see Dealing with real-world data in Time-Series Forecasting.
设置示例数据
The example below creates two tables. Views of these tables are included in the examples later in this topic.
The sales_data table contains sales data. Each sale includes a store ID, an item identifier, a timestamp, and
the sales amount. Additional columns, which are additional features (temperature, humidity, and holiday) are also included.
The future_features table contains future values of the feature columns, which are necessary when forecasting
using features as part of your prediction process.
对单个序列进行预测
此示例使用的是单个时间序列(即所有行都是单个序列的一部分),该序列具有两列(时间戳列和目标值列),没有其他功能。
首先,准备用于训练模型的示例数据集:
SELECT 语句会返回以下内容:
现在,使用该视图训练一个预测模型:
模型训练完成后,系统将显示以下消息:
接下来,使用预测模型预测接下来的三个时间戳:
输出
请注意,该模型已根据训练数据推断出时间戳之间的间隔。
在本示例中,由于预测结果是完全线性的,与实际值相比误差为零,因此预测区间(LOWER_BOUND、UPPER_BOUND)与 FORECAST 值相同。
To customize the size of the prediction interval, pass prediction_interval as part of a configuration object:
To save your results directly to a table, use CREATE TABLE … AS SELECT … and call the FORECAST method in the FROM clause:
As shown in the example above, when calling the method, omit the CALL command. Instead, put the call in parentheses, preceded by the TABLE keyword.
多系列预测
To create a forecasting model for multiple series at once, use the series_colname parameter.
In this example, the data contains store_id and item columns. To forecast sales separately for every store/item
combination in the dataset, create a new column that combines these values, and specify that as the series
column.
The following query creates a new view combining store_id and item into a new column named
store_item:
输出
生成的数据集的每个序列的前五行是:
Now use the forecasting function to train a model for each series, all in one step. Note that the series_colname parameter is set
to store_item:
接下来,使用该模型预测所有序列的下两个时间戳:
输出
您还可以通过以下方式预测特定序列:
输出
结果只显示了商店 2 的雨伞销售的后续两个步骤。
Tip
使用 FORECAST 方法指定一个序列比筛选多序列预测的结果以仅包含您感兴趣的序列更有效,因为这样做只会生成一个序列的预测。
利用特征进行预测
If you want additional features (for example, holidays or weather) to influence your forecasts, you must include these features
in your training data. Here you create a view containing those fields from the sales_data table:
输出
这是 SELECT 查询结果的前五行。
现在,您可以使用此视图来训练模型。您只需指定时间戳和目标列名称;输入数据中的其他列会被假定为要在训练中使用的特征。
To generate forecasts with this model, you must provide future values for the features to the model: in this case, TEMPERATURE,
HUMIDITY and HOLIDAY. This allows the model to adjust its sales forecasts based on temperature, humidity, and holiday
forecasts.
Now create a view from the future_features table containing this data for future timestamps:
输出
现在,您可以使用以下数据生成预测:
In this variation of the FORECAST method, you do not specify the number of timestamps to predict. Instead, the timestamps
of the forecast come from the v2_forecast view.
故障排除和模型评估
您可以使用以下辅助函数来评估模型性能,了解哪些特征对模型的影响最大,并在训练过程中出现任何错误时帮助您调试:
评估指标
To get the evaluation metrics for your model, call the <model_name>!SHOW_EVALUATION_METRICS method. By default, the forecasting function evaluates all models it trains using a method called cross-validation). This means that under the hood, in addition to training the final model on all of the training data you provide, the function also trains models on subsets of your training data. Those models are then used to predict your target metric on the withheld data, allowing the function to compare those predictions to actual values in your historical data.
If you don’t need these evaluation metrics, you can set evaluate to FALSE. If you want to control the way cross-validation is run,
you can use the following parameters:
- n_splits:代表数据中用于交叉验证的拆分数。默认值为 1。
- max_train_size:代表单个训练集的最大行数。
- test_size:限制每个测试集中包括的行数。
- gap:代表每个训练集结束与测试集开始之间的间隙。
For complete details on evaluation parameters, see Evaluation configuration.
Note
Small datasets may not have enough data to perform evaluation. The total number of training rows must be equal to or greater
than (n_splits * test_size) + gap. If not enough data is available to train an evaluation model, no evaluation metrics are available
even when evaluate is set to TRUE.
当 n_splits 为 1(默认值)时,由于只使用验证数据集,评估指标值的标准偏差为 NULL。
示例
输出
特征的重要性
To understand the relative importance of the features used in your model, use the Returns the relative feature importance for each feature used by the model. method.
当您训练预测模型时,您的模型会使用所提供的数据(如时间戳、您的目标指标、您提供的附加列(特征)以及为提高预测性能而自动生成的特征)来学习数据中的模式。通过训练,可以检测出这些因素对于做出准确预测的重要性。本辅助函数的目的是了解这些特征在 0 到 1 的范围内的相对重要性。
本质上,这个辅助函数会计算模型使用每个特征做出决策的次数。然后将这些特征重要性得分归一化为 0 到 1 之间的值,它们的总和是 1。生成的分数表示训练模型中特征的近似排名。
该特征的主要考虑因素
- 得分接近的特征具有相似的重要性。
- 对于极其简单的序列(例如,目标列的值为常数),所有特征重要性得分可能为零。
- Using multiple features that are very similar to each other may result in reduced importance scores for those features. For example, if two features are exactly identical, the model may treat them as interchangeable when making decisions, resulting in feature importance scores that are half of what those scores would be if only one of the identical features were included.
示例
This example uses the data from the evaluation example and calls the feature
importance method. You can see that the exog_a variable that was created is the second most important feature - behind all rolling
averages, which are aggregated under the aggregated_endogenous_trend_features feature name.
执行以下语句,以获得特征的重要性:
输出
故障排除
When you train multiple series with CONFIG_OBJECT => {'ON_ERROR': 'SKIP'}, individual time series models can
fail to train without the overall training process failing. To understand which time series failed and why, call the
<model_name>!SHOW_TRAINING_LOGS method.
示例
输出
模型管理
To view a list of your models, use the SHOW SNOWFLAKE.ML.FORECAST command:
To delete a model, use the DROP SNOWFLAKE.ML.FORECAST command:
模型是不可变的,不能就地更新。改为训练一个新模型。
仓库选择
A Snowflake virtual warehouse provides the compute resources for training and using the machine learning models for this feature. This section provides general guidance on selecting the best type and size of warehouse for this purpose, focusing on the training step, the most time-consuming and memory-intensive part of the process.
在选择仓库时,有两个关键因素需要牢记:
- 数据包含的行数和列数。
- 数据包含的非重复序列数。
可以使用以下经验法则来选择仓库:
- If you are training on a longer time series (> 5 million rows) or on many columns (many features), consider upgrading to Snowpark-optimized warehouses.
- If you are training on many series, size up. The forecasting function distributes model training across all available nodes in your warehouse when you are training for multiple series at once.
下表提供了同样的指导:
| 系列类型 | < 5 million rows | > 5 million rows and ≤ 100 million rows | > 100 million rows |
|---|---|---|---|
| 一个系列 | 标准仓库;XS | Snowpark 优化的仓库:XS | 考虑聚合到频率较低的时间戳间隔(例如,从每小时到每天) |
| 多个系列 | 标准仓库;扩大数据 | Snowpark 优化的仓库;扩大数据 | 考虑按系列将训练分批纳入多个工作中 |
As a rough estimate, training time is proportional to the number of rows in your time series. For example, on a XS standard warehouse,
with evaluation turned off (CONFIG_OBJECT => {'evaluate': False}), training on a 100,000-row dataset takes about
400 seconds. Training on a 1,000,000-row dataset takes about 850 seconds. With
evaluation turned on, training time increases roughly linearly by the number of splits used.
算法详情
The forecasting algorithm used is specified by the (CONFIG_OBJECT => {'method': '<method>'}) config object
parameter. This parameter defaults to ('method': 'best'). When the method is set to 'best', the
algorithm used is an ensemble of multiple models, including Prophet (https://facebook.github.io/prophet/),
ARIMA ,
Exponential Smoothing , and a
gradient boosting machine (described further below).
When the method is set to fast, the algorithm used is a gradient boosting machine (GBM). Like an ARIMA model,
it uses a differencing transformation to model data with a non-stationary trend and uses auto-regressive lags of the
historical target data as model variables. Additionally, the algorithm uses rolling averages of historical target data
to help predict trends, and automatically produces cyclic calendar variables (such as day of week and week of year) from
timestamp data.
您可以仅使用历史目标值和时间戳数据拟合模型,也可以加入可能影响目标值的特征(额外的列)。外生变量可以是数值或分类值,也可以是 NULL (不会删除包含 NULLs 外生变量的行)。
在对分类变量进行训练时,该算法不依赖单次编码,因此可以使用多维度(高基数)的分类数据。
如果模型包含特征,则在生成预测时,必须在整个预测范围的每个时间戳处为这些特征提供值。适当的特征包括天气数据(温度、降雨量)、公司特定信息(历史和计划的公司假期、广告活动、活动时间表),或者您认为可能有助于预测目标变量的任何其他外部因素。
除了预测之外,此算法还可生成预测区间。预测区间是上限和下限内的估计值范围,一定百分比的数据可能会落在这个范围内。例如,值 0.95 表示 95% 的数据可能出现在区间内。您可以指定预测区间百分比,或使用默认值 0.95。预测区间的下限和上限将作为预测输出的一部分返回。
Important
Snowflake 可能会不时改进预测算法。此类改进通过 Snowflake 的常规发布流程推出。您无法恢复到该功能的先前版本,但使用先前版本创建的模型将继续使用该版本进行预测,直到通过行为变更发布流程淘汰为止。
当前限制
当前版本具有以下限制:
- 您无法选择或调整预测算法。
- The minimum number of rows for the main forecasting algorithm is 12 per time series. For time series with between 2 and 11 observations, forecasting produces a “naive” forecast where all forecasted values are equal to the last observed target value.
- The forecasting function does not provide parameters to override trend, seasonality, or seasonal amplitudes; these are inferred from the data.
- 可接受的最小数据粒度为一秒。(时间戳之间的间隔不得短于一秒。)
- The minimum granularity of seasonal components is one minute. (The function cannot detect cyclic patterns at smaller time deltas.)
- The “season length” of autoregressive features is tied to the input frequency (24 for hourly data, 7 for daily data, and so on).
- Forecast models, once trained, are immutable. You cannot update existing models with new data; you must train an entirely new model.
- Models do not support versioning. Snowflake recommends retraining a model on a regular cadence, perhaps daily, weekly, or monthly, depending on how frequently you receive new data, allowing the model to adjust to changing patterns and trends.
- You cannot clone models or share models across roles or accounts. When cloning a schema or database, model objects are skipped.
- You cannot replicate an instance of the FORECAST class.
授予创建预测对象的权限
训练预测模型会生成架构级对象。因此,用于创建模型的角色必须对创建模型的架构具有 CREATE SNOWFLAKE.ML.FORECAST 权限,这样才能将模型存储在该架构中。此权限类似于其他架构权限,如 CREATE TABLE 或 CREATE VIEW。
Snowflake recommends that you create a role named analyst to be used by people who need to create forecasts.
In the following example, the admin role is the owner of the schema admin_db.admin_schema. The
analyst role needs to create models in this schema.
To use this schema, a user assumes the role analyst:
If the analyst role has CREATE SCHEMA privileges in database analyst_db, the role can create a new schema
analyst_db.analyst_schema and create forecast models in that schema:
To revoke a role’s forecast model creation privilege on the schema, use REVOKE <privileges> … FROM ROLE:
成本注意事项
For details on costs for using ML functions, see Cost Considerations in the ML functions overview.