Parallel Hyperparameter Optimization (HPO) on Container Runtime

Snowflake ML 超参数优化 (HPO) API 是一个与模型无关的框架,可对模型进行高效、并行的超参数调整。您可以使用任何开源框架或算法。您也可以使用 Snowflake ML APIs。

You can use HPO in a Snowflake Notebook that’s configured to use the Container Runtime on Snowpark Container Services (SPCS). After you create such a notebook, you can:

  • 使用任何开源包训练模型,并使用此 API 分配超参数调整过程
  • 使用 Snowflake ML 分布式训练 APIs 训练模型,并扩展 HPO,同时也扩展每个训练运行

The HPO workload that you initiate from your notebook executes inside Snowpark Container Services on either CPU or GPU instances. The workload scales out to the CPU or GPU cores that are available on a single node in the SPCS compute pool.

并行化的 HPO API 具有以下优势:

  • 单个 API 可自动处理在多个资源之间分配训练的所有复杂问题
  • The ability to train with virtually any framework or algorithm using open-source ML frameworks or the Snowflake ML modeling APIs
  • A selection of tuning and sampling options, including Bayesian and random search algorithms along with various continuous and non-continuous sampling functions
  • Tight integration with the rest of Snowflake; for example efficient data ingestion via Snowflake Datasets or Dataframes and automatic ML lineage capture

Note

You can scale the HPO run to use multiple nodes in the SPCS compute pool. For more information, see Running a workload on a multi-node cluster.

优化模型的超参数

使用 Snowflake ML HPO API 调整模型。以下步骤说明了这一过程:

  1. 引入数据。
  2. 使用搜索算法定义用于优化超参数的策略。
  3. 定义超参数的采样方式。
  4. 配置调谐器。
  5. 从每个训练作业中获取超参数和训练指标。
  6. 启动训练作业。
  7. 获得训练作业结果。

The following sections walk through the preceding steps. For an example, see Container Runtime HPO Example (https://github.com/Snowflake-Labs/sf-samples/blob/main/samples/ml/container_runtime_hpo/hpo_example.ipynb).

引入数据

Use the dataset_map object to ingest the data into the HPO API. The dataset_map object is a dictionary that pairs the training or test dataset with its corresponding Snowflake DataConnector object. The dataset_map object is passed to the training function. The following is an example of a dataset_map object:

dataset_map = {
  "train": DataConnector.from_dataframe(session.create_dataframe(X_train)),
  "test": DataConnector.from_dataframe(session.create_dataframe(X_test)),
  ),
}

定义搜索算法

定义用于探索超参数空间的搜索算法。算法利用之前的试验结果来确定如何配置超参数。您可以使用以下搜索算法:

  • 网格搜索

搜索您定义的超参数值网格。HPO API 对超参数的每种可能组合进行评估。以下是超参数网格示例:

search_space = {
 "n_estimators": [50, 51],
 "max_depth": [4, 5]),
 "learning_rate": [0.01, 0.3],
}

在前面的示例中,每个参数都有两个可能的值。有 8 (2 * 2 * 2) 种可能的超参数组合。

  • 贝叶斯优化

    Uses a probabilistic model to determine the next set of hyperparameters to evaluate. The algorithm uses the outcomes of previous trials to determine how to configure the hyperparameters. For more information about Bayesian optimization, see Bayesian Optimization (https://github.com/bayesian-optimization/BayesianOptimization).

  • 随机搜索

随机对超参数空间进行采样。这种方法简单有效,尤其适用于大型或混合(连续或离散)搜索空间。

您可以使用以下代码定义搜索算法:

from snowflake.ml.modeling.tune.search import BayesOpt, RandomSearch, GridSearch
search_alg = BayesOpt()
search_alg = RandomSearch()
search_alg = GridSearch()

定义超参数采样

在每次试验中使用搜索空间函数来定义超参数采样方法。用它们来描述超参数的取值范围和类型。

以下是可用的采样功能:

  • uniform(lower, upper): Samples a continuous value uniformly between lower and upper. Useful for parameters like dropout rates or regularization strengths.
  • loguniform(lower, upper): Samples a value in logarithmic space, ideal for parameters that span several orders of magnitude (e.g., learning rates).
  • randint(lower, upper): Samples an integer uniformly between lower (inclusive) and upper (exclusive). Suitable for discrete parameters like the number of layers.
  • choice(options): Randomly selects a value from a provided list. Often used for categorical parameters.

以下示例展示了如何使用均匀函数定义搜索空间:

search_space = {
    "n_estimators": tune.uniform(50, 200),
    "max_depth": tune.uniform(3, 10),
    "learning_rate": tune.uniform(0.01, 0.3),
}

配置调谐器

Use the TunerConfig object to configure the tuner. Within the object, you specify the metric being optimized, the optimization mode, and the other execution parameters. The following are the available configuration options:

  • Metric The performance metric, such as accuracy or loss that you’re optimizing.
  • Mode Determines whether the objective is to maximize or minimize the metric ("max" or "min").
  • Search Algorithm Specifies the strategy for exploring the hyperparameter space.
  • Number of Trials Sets the total number of hyperparameter configurations to evaluate.
  • Concurrency Defines how many trials can run concurrently.

下面的示例代码使用贝叶斯优化库,在五次试验中最大限度地提高模型的准确性。

from snowflake.ml.modeling import tune
tuner_config = tune.TunerConfig(
  metric="accuracy",
  mode="max",
  search_alg=search_algorithm.BayesOpt(
      utility_kwargs={"kind": "ucb", "kappa": 2.5, "xi": 0.0}
  ),
  num_trials=5,
  max_concurrent_trials=1,
)

获取超参数和训练指标

The Snowflake ML HPO API requires the training metrics and hyperparameters from each training run to optimize the hyperparameters effectively. Use the TunerContext object to get the hyperparameters and training metrics. The following example creates a training function to get the hyperparameters and training metrics:

def train_func():
  tuner_context = get_tuner_context()
  config = tuner_context.get_hyper_params()
  dm = tuner_context.get_dataset_map()
  ...
  tuner_context.report(metrics={"accuracy": accuracy}, model=model)

启动训练作业

Use the Tuner object to initiate the training job. The Tuner object takes the training function, search space, and tuner configuration as arguments. The following is an example of how to initiate the training job:

from snowflake.ml.modeling import tune
tuner = tune.Tuner(train_func, search_space, tuner_config)
tuner_results = tuner.run(dataset_map=dataset_map)

前面的代码将训练函数分配到可用的资源上。它收集和总结试验结果,并确定性能最佳的配置。

获得训练作业成果

After all trials are completed, the TunerResults object consolidates the outcomes of each trial. It provides structured access to the performance metrics, the best configuration, and the best model.

以下是该对象的可用属性:

  • results:一个包含每次试验的评估指标和配置参数的 Pandas DataFrame。
  • best_result:一个总结性能最佳的试验的 DataFrame 行。
  • best_model:与最佳试验相关的模型实例(如适用)。

以下代码可以得出结果、最佳模型和最佳结果:

print(tuner_results.results)
print(tuner_results.best_model)
print(tuner_results.best_result)

API 参考

调谐器

以下是调谐器模块的导入语句:

from snowflake.ml.modeling.tune import Tuner

调谐器类是与容器运行时 HPO API 交互的主要接口。要运行 HPO 作业,请使用以下代码初始化调谐器对象,并调用带有数据集的运行方法。

class Tuner:
  def __init__(
      self,
      train_func: Callable,
      search_space: SearchSpace,
      tuner_config: TunerConfig,
  )

  def run(
      self, dataset_map: Optional[Dict[str, DataConnector]] = None
  ) -> TunerResults

SearchSpace

以下是搜索空间的导入语句:

from snowflake.ml.modeling.tune import uniform, choice, loguniform, randint

以下代码定义搜索空间函数:

def uniform(lower: float, upper: float)
    """
    Sample a float value uniformly between lower and upper.

    Use for parameters where all values in range are equally likely to be optimal.
    Examples: dropout rates (0.1 to 0.5), batch normalization momentum (0.1 to 0.9).
    """


def loguniform(lower: float, upper: float) -> float:
    """
    Sample a float value uniformly in log space between lower and upper.

    Use for parameters spanning several orders of magnitude.
    Examples: learning rates (1e-5 to 1e-1), regularization strengths (1e-4 to 1e-1).
    """


def randint(lower: int, upper: int) -> int:
    """
    Sample an integer value uniformly between lower(inclusive) and upper(exclusive).

    Use for discrete parameters with a range of values.
    Examples: number of layers, number of epochs, number of estimators.
    """



def choice(options: List[Union[float, int, str]]) -> Union[float, int, str]:
    """
    Sample a value uniformly from the given options.

    Use for categorical parameters or discrete options.
    Examples: activation functions ['relu', 'tanh', 'sigmoid']
    """

TunerConfig

以下是 TunerConfig 模块的导入语句:

from snowflake.ml.modeling.tune import TunerConfig

使用以下代码定义调谐器的配置类:

class TunerConfig:
  """
  Configuration class for the tuning process.

  Attributes:
    metric (str): The name of the metric to optimize. This should correspond
        to a key in the metrics dictionary reported by the training function.

    mode (str): The optimization mode for the metric. Must be either "min"
        for minimization or "max" for maximization.

    search_alg (SearchAlgorithm): The search algorithm to use for
        exploring the hyperparameter space. Defaults to random search.

    num_trials (int): The maximum number of parameter configurations to
        try. Defaults to 5

    max_concurrent_trials (Optional[int]): The maximum number of concurrently running trials per node. If   not specified, it defaults to the total number of nodes in the cluster. This value must be a positive
    integer if provided.


  Example:
      >>> from snowflake.ml.modeling.tune import  TunerConfig
      >>> config = TunerConfig(
      ...     metric="accuracy",
      ...     mode="max",
      ...     num_trials=5,
      ...     max_concurrent_trials=1
      ... )
  """

SearchAlgorithm

以下是搜索算法的导入语句:

from snowflake.ml.modeling.tune.search import BayesOpt, RandomSearch, GridSearch

以下代码创建了一个贝叶斯优化搜索算法对象:

@dataclass
class BayesOpt():
    """
    Bayesian Optimization class that encapsulates parameters for the acquisition function.

    This class is designed to facilitate Bayesian optimization by configuring
    the acquisition function through a dictionary of keyword arguments.

    Attributes:
        utility_kwargs (Optional[Dict[str, Any]]):
            A dictionary specifying parameters for the utility (acquisition) function.
            If not provided, it defaults to:
                {
                    'kind': 'ucb',   # Upper Confidence Bound acquisition strategy
                    'kappa': 2.576,  # Exploration parameter for UCB
                    'xi': 0.0      # Exploitation parameter
                }
    """
    utility_kwargs: Optional[Dict[str, Any]] = None

以下代码创建一个随机搜索算法对象:

@dataclass
class RandomSearch():
    The default and most basic way to do hyperparameter search is via random search.

    Attributes:
Seed or NumPy random generator for reproducible results. If set to None (default), the global generator (np.random) is used.
    random_state: Optional[int] = None

TunerResults

以下代码创建了 TunerResults 对象:

@dataclass
class TunerResults:
    results: pd.DataFrame
    best_result: pd.DataFrame
    best_model: Optional[Any]

get_tuner_context

The following is the import statement for the get_tuner_context module:

from snowflake.ml.modeling.tune import get_tuner_context

该辅助方法专为在训练函数内部调用而设计。它返回一个 TunerContext 对象,该对象封装了多个对运行试验有用的字段,包括:

  • 由 ​​HPO​​ 框架为当前试验选择的超参数。
  • 训练所需的数据集。
  • 这是一个用于上报指标的辅助函数,可指导 ​​HPO​​ 框架推荐下一组超参数。

以下代码将创建一个调谐器上下文对象:

class TunerContext:
    """
    A centralized context class for managing trial configuration, reporting, and dataset information.
    """

    def get_hyper_params(self) -> Dict[str, Any]:
        """
        Retrieve the configuration dictionary.

        Returns:
            Dict[str, Any]: The configuration dictionary for the trial.
        """
        return self._hyper_params

    def report(self, metrics: Dict[str, Any], model: Optional[Any] = None) -> None:
    """
    Report metrics and optionally the model if provided.

    This method is used to report the performance metrics of a model and, if provided, the model itself.
    The reported metrics will be used to guide the next set of hyperparameters selection in the
    optimization process.

    Args:
        metrics (Dict[str, Any]): A dictionary containing the performance metrics of the model.
            The keys are metric names, and the values are the corresponding metric values.
        model (Optional[Any], optional): The trained model to be reported. Defaults to None.

    Returns:
        None: This method doesn't return anything.
    """

    def get_dataset_map(self) -> Optional[Dict[str, Type[DataConnector]]]:
        """
        Retrieve the dataset mapping.

        Returns:
            Optional[Dict[str, Type[DataConnector]]]: A mapping of dataset names to DataConnector types, if available.
        """
        return self._dataset_map

限制

Bayesian optimization requires continuous search spaces and works only with the uniform sampling function. It is incompatible with discrete parameters. sampled using the tune.randint or tune.choice methods. To work around this limitation, either use tune.uniform and cast the parameter inside the training function, or switch to a sampling algorithm that handles both discrete and continuous spaces, such as tune.RandomSearch.

故障排除

Error messagePossible causesPossible solutions
Invalid search space configuration: BayesOpt requires all sampling functions to be of type ‘Uniform’.Bayesian optimization works only with uniform sampling, not with discrete samples. (See Limitations above.)
  • Use tune.uniform and cast the result in your training function.
  • Switch to RandomSearch algorithm, which accepts both discrete and non-discrete samples.
Insufficient CPU resources. Required: 16, Available: 8. The numbers of required and available resources may differ.max_concurrent_trials is set to a value higher than the available cores.Follow guidance provided by the error message.
Insufficient GPU resources. Required: 4, Available: 2. May refer to CPU or GPU. The numbers of required and available resources may differ.max_concurrent_trials is set to a value higher than the available cores.Follow the guidance provided by the error message.

Next steps