ML 的容器运行时的并行超参数优化 (HPO)¶

Snowflake ML 超参数优化 (HPO) API 是一个与模型无关的框架，可对模型进行高效、并行的超参数调整。您可以使用任何开源框架或算法。您也可以使用 Snowflake ML APIs。

You can use HPO in a Snowflake Notebook that's configured to use the Container Runtime on Snowpark Container Services (SPCS). After you create such a notebook, you can:

使用任何开源包训练模型，并使用此 API 分配超参数调整过程
使用 Snowflake ML 分布式训练 APIs 训练模型，并扩展 HPO，同时也扩展每个训练运行

The HPO workload that you initiate from your notebook executes inside Snowpark Container Services on either CPU or GPU instances. The workload scales out to the CPU or GPU cores that are available on a single node in the SPCS compute pool.

并行化的 HPO API 具有以下优势：

单个 API 可自动处理在多个资源之间分配训练的所有复杂问题
利用开源 ML 框架或 Snowflake ML 建模 APIs，几乎能使用任何框架或算法进行训练
一系列调整和采样选项，包括贝叶斯和随机搜索算法，以及各种连续和非连续采样函数
与 Snowflake 的其他功能紧密集成；例如，通过 Snowflake 数据集或数据帧高效引入数据，以及自动获取 ML 沿袭

备注

您可以扩展 HPO 运行范围，以使用 SPCS 计算池中的多个节点。有关更多信息，请参阅在多节点集群上运行工作负载。

优化模型的超参数¶

使用 Snowflake ML HPO API 调整模型。以下步骤说明了这一过程：

引入数据。
使用搜索算法定义用于优化超参数的策略。
定义超参数的采样方式。
配置调谐器。
从每个训练作业中获取超参数和训练指标。
启动训练作业。
获得训练作业结果。

以下各部分将介绍前面的步骤。有关示例，请参阅容器运行时 HPO 示例 (https://github.com/Snowflake-Labs/sf-samples/blob/main/samples/ml/container_runtime_hpo/hpo_example.ipynb)。

引入数据¶

使用 dataset_map 对象将数据引入 HPO API。dataset_map 对象是一个字典，用于将训练或测试数据集与其对应的 Snowflake DataConnector 对象配对。dataset_map 对象将传递给训练函数。以下是 dataset_map 对象示例：

dataset_map = {
  "train": DataConnector.from_dataframe(session.create_dataframe(X_train)),
  "test": DataConnector.from_dataframe(session.create_dataframe(X_test)),
  ),
}

Copy

定义搜索算法¶

定义用于探索超参数空间的搜索算法。算法利用之前的试验结果来确定如何配置超参数。您可以使用以下搜索算法：

网格搜索

搜索您定义的超参数值网格。HPO API 对超参数的每种可能组合进行评估。以下是超参数网格示例：
```
search_space = {
    "n_estimators": [50, 51],
    "max_depth": [4, 5]),
    "learning_rate": [0.01, 0.3],
}
```
Copy
在前面的示例中，每个参数都有两个可能的值。有 8 (2 * 2 * 2) 种可能的超参数组合。
贝叶斯优化

使用概率模型来确定下一组要评估的超参数。算法利用之前的试验结果来确定如何配置超参数。有关贝叶斯优化的更多信息，请参阅贝叶斯优化 (https://github.com/bayesian-optimization/BayesianOptimization)。
随机搜索

随机对超参数空间进行采样。这种方法简单有效，尤其适用于大型或混合（连续或离散）搜索空间。

您可以使用以下代码定义搜索算法：

from snowflake.ml.modeling.tune.search import BayesOpt, RandomSearch, GridSearch
search_alg = BayesOpt()
search_alg = RandomSearch()
search_alg = GridSearch()

Copy

定义超参数采样¶

在每次试验中使用搜索空间函数来定义超参数采样方法。用它们来描述超参数的取值范围和类型。

以下是可用的采样功能：

uniform(lower, upper)：在 lower 和 upper 之间均匀采样一个连续值。该功能特别适用于调节诸如丢弃率或正则化强度等参数。
loguniform(lower, upper)：该函数采用对数空间采样，特别适用于取值范围跨越多个数量级的参数（如学习率）。
randint(lower,``upper``)：在 lower（包含）和 upper（不包含）之间均匀采样一个整数。适用于离散参数，如层数。
choice(options)：从提供的列表中随机选择一个值。常用于分类参数。

以下示例展示了如何使用均匀函数定义搜索空间：

search_space = {
    "n_estimators": tune.uniform(50, 200),
    "max_depth": tune.uniform(3, 10),
    "learning_rate": tune.uniform(0.01, 0.3),
}

Copy

配置调谐器¶

使用 TunerConfig 对象配置调谐器。在该对象中，您可以指定要优化的指标、优化模式和其他执行参数。以下是可用的配置选项：

指标您要优化的性能指标，如准确度或损耗。
模式用于确定目标是最大化还是最小化指标（"max" 或 "min"）。
搜索算法指定探索超参数空间的策略。
试验次数设置要评估的超参数配置总数。
并发定义可以同时运行多少个试验。

下面的示例代码使用贝叶斯优化库，在五次试验中最大限度地提高模型的准确性。

from snowflake.ml.modeling import tune
tuner_config = tune.TunerConfig(
  metric="accuracy",
  mode="max",
  search_alg=search_algorithm.BayesOpt(
      utility_kwargs={"kind": "ucb", "kappa": 2.5, "xi": 0.0}
  ),
  num_trials=5,
  max_concurrent_trials=1,
)

Copy

获取超参数和训练指标¶

Snowflake ML HPO API 需要每次训练运行的训练指标和超参数，以便有效优化超参数。使用 TunerContext 对象获取超参数和训练指标。下面的示例创建了一个训练函数，用于获取超参数和训练指标：

def train_func():
  tuner_context = get_tuner_context()
  config = tuner_context.get_hyper_params()
  dm = tuner_context.get_dataset_map()
  ...
  tuner_context.report(metrics={"accuracy": accuracy}, model=model)

Copy

启动训练作业¶

使用 Tuner 对象启动训练作业。Tuner 对象将训练函数、搜索空间和调谐器配置作为实参。以下示例展示了如何启动训练作业：

from snowflake.ml.modeling import tune
tuner = tune.Tuner(train_func, search_space, tuner_config)
tuner_results = tuner.run(dataset_map=dataset_map)

Copy

前面的代码将训练函数分配到可用的资源上。它收集和总结试验结果，并确定性能最佳的配置。

获得训练作业成果¶

所有试验完成后，``TunerResults`` 对象将整合每次试验的结果数据。它提供了对性能指标、最佳配置和最佳模型的结构化访问。

以下是该对象的可用属性：

results：一个包含每次试验的评估指标和配置参数的 Pandas DataFrame。
best_result：一个总结性能最佳的试验的 DataFrame 行。
best_model：与最佳试验相关的模型实例（如适用）。

以下代码可以得出结果、最佳模型和最佳结果：

print(tuner_results.results)
print(tuner_results.best_model)
print(tuner_results.best_result)

Copy

API 参考¶

调谐器¶

以下是调谐器模块的导入语句：

from snowflake.ml.modeling.tune import Tuner

Copy

调谐器类是与容器运行时 HPO API 交互的主要接口。要运行 HPO 作业，请使用以下代码初始化调谐器对象，并调用带有数据集的运行方法。

class Tuner:
  def __init__(
      self,
      train_func: Callable,
      search_space: SearchSpace,
      tuner_config: TunerConfig,
  )

  def run(
      self, dataset_map: Optional[Dict[str, DataConnector]] = None
  ) -> TunerResults

Copy

SearchSpace¶

以下是搜索空间的导入语句：

from snowflake.ml.modeling.tune import uniform, choice, loguniform, randint

Copy

以下代码定义搜索空间函数：

def uniform(lower: float, upper: float)
    """
    Sample a float value uniformly between lower and upper.

    Use for parameters where all values in range are equally likely to be optimal.
    Examples: dropout rates (0.1 to 0.5), batch normalization momentum (0.1 to 0.9).
    """


def loguniform(lower: float, upper: float) -> float:
    """
    Sample a float value uniformly in log space between lower and upper.

    Use for parameters spanning several orders of magnitude.
    Examples: learning rates (1e-5 to 1e-1), regularization strengths (1e-4 to 1e-1).
    """


def randint(lower: int, upper: int) -> int:
    """
    Sample an integer value uniformly between lower(inclusive) and upper(exclusive).

    Use for discrete parameters with a range of values.
    Examples: number of layers, number of epochs, number of estimators.
    """



def choice(options: List[Union[float, int, str]]) -> Union[float, int, str]:
    """
    Sample a value uniformly from the given options.

    Use for categorical parameters or discrete options.
    Examples: activation functions ['relu', 'tanh', 'sigmoid']
    """

Copy

TunerConfig¶

以下是 TunerConfig 模块的导入语句：

from snowflake.ml.modeling.tune import TunerConfig

Copy

使用以下代码定义调谐器的配置类：

class TunerConfig:
  """
  Configuration class for the tuning process.

  Attributes:
    metric (str): The name of the metric to optimize. This should correspond
        to a key in the metrics dictionary reported by the training function.

    mode (str): The optimization mode for the metric. Must be either "min"
        for minimization or "max" for maximization.

    search_alg (SearchAlgorithm): The search algorithm to use for
        exploring the hyperparameter space. Defaults to random search.

    num_trials (int): The maximum number of parameter configurations to
        try. Defaults to 5

    max_concurrent_trials (Optional[int]): The maximum number of concurrently running trials per node. If   not specified, it defaults to the total number of nodes in the cluster. This value must be a positive
    integer if provided.


  Example:
      >>> from snowflake.ml.modeling.tune import  TunerConfig
      >>> config = TunerConfig(
      ...     metric="accuracy",
      ...     mode="max",
      ...     num_trials=5,
      ...     max_concurrent_trials=1
      ... )
  """

Copy

SearchAlgorithm¶

以下是搜索算法的导入语句：

from snowflake.ml.modeling.tune.search import BayesOpt, RandomSearch, GridSearch

Copy

以下代码创建了一个贝叶斯优化搜索算法对象：

@dataclass
class BayesOpt():
    """
    Bayesian Optimization class that encapsulates parameters for the acquisition function.

    This class is designed to facilitate Bayesian optimization by configuring
    the acquisition function through a dictionary of keyword arguments.

    Attributes:
        utility_kwargs (Optional[Dict[str, Any]]):
            A dictionary specifying parameters for the utility (acquisition) function.
            If not provided, it defaults to:
                {
                    'kind': 'ucb',   # Upper Confidence Bound acquisition strategy
                    'kappa': 2.576,  # Exploration parameter for UCB
                    'xi': 0.0      # Exploitation parameter
                }
    """
    utility_kwargs: Optional[Dict[str, Any]] = None

Copy

以下代码创建一个随机搜索算法对象：

@dataclass
class RandomSearch():
    The default and most basic way to do hyperparameter search is via random search.

    Attributes:
Seed or NumPy random generator for reproducible results. If set to None (default), the global generator (np.random) is used.
    random_state: Optional[int] = None

Copy

TunerResults¶

以下代码创建了 TunerResults 对象：

@dataclass
class TunerResults:
    results: pd.DataFrame
    best_result: pd.DataFrame
    best_model: Optional[Any]

Copy

get_tuner_context¶

以下是 get_tuner_context 模块的导入语句：

from snowflake.ml.modeling.tune import get_tuner_context

Copy

该辅助方法专为在训练函数内部调用而设计。它返回一个 TunerContext 对象，该对象封装了多个对运行试验有用的字段，包括：

由 HPO 框架为当前试验选择的超参数。
训练所需的数据集。
这是一个用于上报指标的辅助函数，可指导 HPO 框架推荐下一组超参数。

以下代码将创建一个调谐器上下文对象：

class TunerContext:
    """
    A centralized context class for managing trial configuration, reporting, and dataset information.
    """

    def get_hyper_params(self) -> Dict[str, Any]:
        """
        Retrieve the configuration dictionary.

        Returns:
            Dict[str, Any]: The configuration dictionary for the trial.
        """
        return self._hyper_params

    def report(self, metrics: Dict[str, Any], model: Optional[Any] = None) -> None:
    """
    Report metrics and optionally the model if provided.

    This method is used to report the performance metrics of a model and, if provided, the model itself.
    The reported metrics will be used to guide the next set of hyperparameters selection in the
    optimization process.

    Args:
        metrics (Dict[str, Any]): A dictionary containing the performance metrics of the model.
            The keys are metric names, and the values are the corresponding metric values.
        model (Optional[Any], optional): The trained model to be reported. Defaults to None.

    Returns:
        None: This method doesn't return anything.
    """

    def get_dataset_map(self) -> Optional[Dict[str, Type[DataConnector]]]:
        """
        Retrieve the dataset mapping.

        Returns:
            Optional[Dict[str, Type[DataConnector]]]: A mapping of dataset names to DataConnector types, if available.
        """
        return self._dataset_map

Copy

限制¶

贝叶斯优化需要连续的搜索空间，并且只适用于均匀采样函数。它与使用 tune.randint 或 tune.choice 方法采样的离散参数不兼容。要绕过这一限制，要么使用 tune.uniform 并转换训练函数中的参数，要么改用可同时处理离散空间和连续空间的采样算法，如 tune.RandomSearch。

故障排除¶

错误消息	可能原因	可能的解决方案
无效的搜索空间配置：BayesOpt 要求所有采样函数都是“均匀”类型。	贝叶斯优化只适用于均匀采样，不适用于离散采样。（请参阅上面的限制）。	使用 `tune.uniform` 并将结果注入训练函数。切换到 `RandomSearch` 算法，该算法同时接受离散和非离散样本。
CPU 资源不足。需要：16，可用：8。所需资源和可用资源的数量可能有所不同。	`max_concurrent_trials` 设置为高于可用核心的值。	按照错误信息提供的指导进行操作。
GPU 资源不足。需要：4，可用：2。可参阅 CPU 或 GPU。所需资源和可用资源的数量可能有所不同。	`max_concurrent_trials` 设置为高于可用核心的值。	按照错误信息提供的指导进行操作。

Next steps¶

Container Runtime HPO Example (https://github.com/Snowflake-Labs/sf-samples/blob/main/samples/ml/container_runtime_hpo/hpo_example.ipynb)