Parallel Hyperparameter Optimization (HPO) on Container Runtime¶
Snowflake ML 超参数优化 (HPO) API 是一个与模型无关的框架,可对模型进行高效、并行的超参数调整。您可以使用任何开源框架或算法。您也可以使用 Snowflake ML APIs。
You can use HPO in a Snowflake Notebook that’s configured to use the Container Runtime on Snowpark Container Services (SPCS). After you create such a notebook, you can:
- 使用任何开源包训练模型,并使用此 API 分配超参数调整过程
- 使用 Snowflake ML 分布式训练 APIs 训练模型,并扩展 HPO,同时也扩展每个训练运行
The HPO workload that you initiate from your notebook executes inside Snowpark Container Services on either CPU or GPU instances. The workload scales out to the CPU or GPU cores that are available on a single node in the SPCS compute pool.
并行化的 HPO API 具有以下优势:
- 单个 API 可自动处理在多个资源之间分配训练的所有复杂问题
- The ability to train with virtually any framework or algorithm using open-source ML frameworks or the Snowflake ML modeling APIs
- A selection of tuning and sampling options, including Bayesian and random search algorithms along with various continuous and non-continuous sampling functions
- Tight integration with the rest of Snowflake; for example efficient data ingestion via Snowflake Datasets or Dataframes and automatic ML lineage capture
Note
You can scale the HPO run to use multiple nodes in the SPCS compute pool. For more information, see Running a workload on a multi-node cluster.
优化模型的超参数
使用 Snowflake ML HPO API 调整模型。以下步骤说明了这一过程:
- 引入数据。
- 使用搜索算法定义用于优化超参数的策略。
- 定义超参数的采样方式。
- 配置调谐器。
- 从每个训练作业中获取超参数和训练指标。
- 启动训练作业。
- 获得训练作业结果。
The following sections walk through the preceding steps. For an example, see Container Runtime HPO Example (https://github.com/Snowflake-Labs/sf-samples/blob/main/samples/ml/container_runtime_hpo/hpo_example.ipynb).
引入数据
Use the dataset_map object to ingest the data into the HPO API. The dataset_map object is a dictionary that pairs
the training or test dataset with its corresponding Snowflake DataConnector object. The dataset_map object is passed to the
training function. The following is an example of a dataset_map object:
定义搜索算法
定义用于探索超参数空间的搜索算法。算法利用之前的试验结果来确定如何配置超参数。您可以使用以下搜索算法:
- 网格搜索
搜索您定义的超参数值网格。HPO API 对超参数的每种可能组合进行评估。以下是超参数网格示例:
在前面的示例中,每个参数都有两个可能的值。有 8 (2 * 2 * 2) 种可能的超参数组合。
-
贝叶斯优化
Uses a probabilistic model to determine the next set of hyperparameters to evaluate. The algorithm uses the outcomes of previous trials to determine how to configure the hyperparameters. For more information about Bayesian optimization, see Bayesian Optimization (https://github.com/bayesian-optimization/BayesianOptimization).
-
随机搜索
随机对超参数空间进行采样。这种方法简单有效,尤其适用于大型或混合(连续或离散)搜索空间。
您可以使用以下代码定义搜索算法:
定义超参数采样
在每次试验中使用搜索空间函数来定义超参数采样方法。用它们来描述超参数的取值范围和类型。
以下是可用的采样功能:
- uniform(
lower,upper): Samples a continuous value uniformly between lower and upper. Useful for parameters like dropout rates or regularization strengths. - loguniform(
lower,upper): Samples a value in logarithmic space, ideal for parameters that span several orders of magnitude (e.g., learning rates). - randint(
lower,upper): Samples an integer uniformly between lower (inclusive) and upper (exclusive). Suitable for discrete parameters like the number of layers. - choice(options): Randomly selects a value from a provided list. Often used for categorical parameters.
以下示例展示了如何使用均匀函数定义搜索空间:
配置调谐器
Use the TunerConfig object to configure the tuner. Within the object, you specify the metric being optimized, the optimization mode, and the other execution parameters. The following are the available configuration options:
- Metric The performance metric, such as accuracy or loss that you’re optimizing.
- Mode
Determines whether the objective is to maximize or minimize the metric (
"max"or"min"). - Search Algorithm Specifies the strategy for exploring the hyperparameter space.
- Number of Trials Sets the total number of hyperparameter configurations to evaluate.
- Concurrency Defines how many trials can run concurrently.
下面的示例代码使用贝叶斯优化库,在五次试验中最大限度地提高模型的准确性。
获取超参数和训练指标
The Snowflake ML HPO API requires the training metrics and hyperparameters from each training run to optimize the hyperparameters effectively. Use the TunerContext object to get the hyperparameters and training metrics. The following example creates a training function to get the hyperparameters and training metrics:
启动训练作业
Use the Tuner object to initiate the training job. The Tuner object takes the training function, search space, and tuner configuration as arguments. The following is an example of how to initiate the training job:
前面的代码将训练函数分配到可用的资源上。它收集和总结试验结果,并确定性能最佳的配置。
获得训练作业成果
After all trials are completed, the
TunerResultsobject consolidates the outcomes of each trial. It provides structured access to the performance metrics, the best configuration, and the best model.以下是该对象的可用属性:
- results:一个包含每次试验的评估指标和配置参数的 Pandas DataFrame。
- best_result:一个总结性能最佳的试验的 DataFrame 行。
- best_model:与最佳试验相关的模型实例(如适用)。
以下代码可以得出结果、最佳模型和最佳结果:
API 参考¶
调谐器
以下是调谐器模块的导入语句:
调谐器类是与容器运行时 HPO API 交互的主要接口。要运行 HPO 作业,请使用以下代码初始化调谐器对象,并调用带有数据集的运行方法。
SearchSpace¶
以下是搜索空间的导入语句:
以下代码定义搜索空间函数:
TunerConfig¶
以下是 TunerConfig 模块的导入语句:
使用以下代码定义调谐器的配置类:
SearchAlgorithm¶
以下是搜索算法的导入语句:
以下代码创建了一个贝叶斯优化搜索算法对象:
以下代码创建一个随机搜索算法对象:
TunerResults¶
以下代码创建了 TunerResults 对象:
get_ tuner_ context¶
The following is the import statement for the get_tuner_context module:
该辅助方法专为在训练函数内部调用而设计。它返回一个 TunerContext 对象,该对象封装了多个对运行试验有用的字段,包括:
- 由 HPO 框架为当前试验选择的超参数。
- 训练所需的数据集。
- 这是一个用于上报指标的辅助函数,可指导 HPO 框架推荐下一组超参数。
以下代码将创建一个调谐器上下文对象:
限制
Bayesian optimization requires continuous search spaces and works only with the uniform sampling function. It is incompatible with discrete parameters.
sampled using the tune.randint or tune.choice methods. To work around this limitation, either use
tune.uniform and cast the parameter inside the training function, or switch to a sampling algorithm that handles
both discrete and continuous spaces, such as tune.RandomSearch.
故障排除
| Error message | Possible causes | Possible solutions |
|---|---|---|
| Invalid search space configuration: BayesOpt requires all sampling functions to be of type ‘Uniform’. | Bayesian optimization works only with uniform sampling, not with discrete samples. (See Limitations above.) |
|
| Insufficient CPU resources. Required: 16, Available: 8. The numbers of required and available resources may differ. | max_concurrent_trials is set to a value higher than the available cores. | Follow guidance provided by the error message. |
| Insufficient GPU resources. Required: 4, Available: 2. May refer to CPU or GPU. The numbers of required and available resources may differ. | max_concurrent_trials is set to a value higher than the available cores. | Follow the guidance provided by the error message. |
Next steps¶
- Container Runtime HPO Example (https://github.com/Snowflake-Labs/sf-samples/blob/main/samples/ml/container_runtime_hpo/hpo_example.ipynb)