Snowpark Checkpoints 库：假设¶

Hypothesis 单元测试¶

Hypothesis 是一个强大的 Python 测试库，旨在通过自动生成广泛的输入数据来增强传统的单元测试。它使用基于属性的测试，您可以使用属性或条件来描述代码的预期行为，而不是指定单个测试用例，而 Hypothesis 会生成示例以对这些属性进行全面测试。这种方法有助于发现边缘情况和意外行为，使其对复杂函数特别有效。有关更多信息，请参阅 Hypothesis (https://hypothesis.readthedocs.io/en/latest/)。

snowpark-checkpoints-hypothesis 包扩展了 Hypothesis 库，以生成用于测试目的的合成 Snowpark DataFrames。通过利用 Hypothesis 生成多样化和随机测试数据的功能，您可以创建具有不同架构和值的 Snowpark DataFrames，以模拟真实场景和发现边缘情况，确保稳健的代码并验证复杂变换的正确性。

Snowpark 的 Hypothesis 策略依赖 Pandera 生成合成数据。dataframe_strategy 函数使用指定的架构生成符合该架构的 Pandas DataFrame，然后将其转换为 Snowpark DataFrame。

函数签名：

def dataframe_strategy(
  schema: Union[str, DataFrameSchema],
  session: Session,
  size: Optional[int] = None
) -> SearchStrategy[DataFrame]

函数参数：

schema：架构，用于定义列、数据类型并检查生成的 Snowpark 数据框是否匹配。架构可以是：

The schema can be:
- snowpark-checkpoints-collectors 包的 collect_dataframe_checkpoint 函数生成的 JSON 架构文件的路径。
- `pandera.api.pandas.container.DataFrameSchema<https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.container.DataFrameSchema.html>`_ 的实例。
session：将用于创建 Snowpark DataFrames 的 snowflake.snowpark.Session 的实例。
size：为每个 Snowpark DataFrame 生成的行数。如果不提供此参数，该策略将生成不同大小的 DataFrames。

size：为每个 Snowpark DataFrame 生成的行数。如果不提供此参数，该策略将生成不同大小的 DataFrames。

函数输出：

返回生成 Snowpark DataFrames 的 Hypothesis SearchStrategy (https://github.com/HypothesisWorks/hypothesis/blob/904bdd967ca9ff23475aa6abe860a30925149da7/hypothesis-python/src/hypothesis/strategies/_internal/strategies.py#L221)。

支持和不支持的数据类型¶

dataframe_strategy 函数支持生成具有不同数据类型的 Snowpark DataFrames。根据传递给函数的架构实参的类型，策略支持的数据类型会有所不同。请注意，如果策略发现不支持的数据类型，将引发异常。

下表显示了当作为 schema 实参传递 JSON 文件时，dataframe_strategy 函数支持和不支持的 PySpark 数据类型。


PySpark 数据类型	支持
数组 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.ArrayType.html)	是
布尔值 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.BooleanType.html)	是
字符 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.CharType.html)	否
日期 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.DateType.html)	是
DayTimeIntervalType (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.DayTimeIntervalType.html)	否
小数 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.DecimalType.html)	否
映射 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.MapType.html)	否
空值 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.NullType.html)	否
字节型 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.ByteType.html)、短整型 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.ShortType.html)、整型 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.IntegerType.html)、` 长整型 <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.LongType.html (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.LongType.html)>`_、单精度浮点型 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.FloatType.html)、双精度浮点型 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.DoubleType.html)	是
字符串 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.StringType.html)	是
结构 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.StructType.html)	否
时间戳 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.TimestampType.html)	是
TimestampNTZ (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.TimestampNTZType.html)	是
可变长度字符串 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.VarcharType.html)	否
YearMonthIntervalType (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.YearMonthIntervalType.html)	否

下表显示了 dataframe_strategy 函数在作为 schema 实参传递 DataFrameSchema 对象时支持的 Pandera 数据类型及其映射到的 Snowpark 数据类型。


Pandera 数据类型	Snowpark 数据类型
int8	ByteType
int16	ShortType
int32	IntegerType
int64	LongType
float32	FloatType
float64	DoubleType
字符串	StringType
bool	BooleanType
datetime64[ns, tz]	TimestampType(TZ)
datetime64[ns]	TimestampType(NTZ)
日期	DateType

示例¶

使用 Hypothesis 库生成 Snowpark DataFrames 的典型工作流如下：

创建标准的 Python 测试函数，其中包含代码应满足的所有输入的不同断言或条件。
将 Hypothesis @given 装饰器添加到测试函数中，并将 dataframe_strategy 函数作为实参传递。有关 @given 装饰器的更多信息，请参阅 hypothesis.given (https://hypothesis.readthedocs.io/en/latest/details.html#hypothesis.given)。

将 Hypothesis @given 装饰器添加到测试函数中，并将 dataframe_strategy 函数作为实参传递。有关 @given 装饰器的更多信息，请参阅 hypothesis.given (https://hypothesis.readthedocs.io/en/latest/details.html#hypothesis.given)。
Run the test function.

运行测试函数。执行测试时，Hypothesis 会自动将生成的输入作为实参提供给测试。

示例 1：从 JSON 文件生成 Snowpark DataFrames

下面示例介绍了如何从 snowpark-checkpoints-collectors 包的 collect_dataframe_checkpoint 函数生成的 JSON 架构文件生成 Snowpark DataFrames。

from hypothesis import given

from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session


@given(
    df=dataframe_strategy(
        schema="path/to/file.json",
        session=Session.builder.getOrCreate(),
        size=10,
    )
)
def test_my_function_from_json_file(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...

示例 2：从 Pandera DataFrameSchema 对象生成 Snowpark DataFrame

import pandera as pa

from hypothesis import given

from snowflake.hypothesis_snowpark import dataframe_strategy
from snowflake.snowpark import DataFrame, Session


@given(
    df=dataframe_strategy(
        schema=pa.DataFrameSchema(
            {
                "boolean_column": pa.Column(bool),
                "integer_column": pa.Column("int64", pa.Check.in_range(0, 9)),
                "float_column": pa.Column(pa.Float32, pa.Check.in_range(10.5, 20.5)),
            }
        ),
        session=Session.builder.getOrCreate(),
        size=10,
    )
)
def test_my_function_from_dataframeschema_object(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...

下面示例介绍了如何从 Pandera DataFrameSchema 实例生成 Snowpark DataFrames。有关更多信息，请参阅 Pandera DataFrameSchema (https://pandera.readthedocs.io/en/latest/dataframe_schemas.html)。

示例 3：自定义 Hypothesis 行为

You can also customize the behavior of your test with the Hypothesis @settings decorator. This decorator allows you to customize various configuration parameters to tailor test behavior to your needs.

By using the @settings decorator, you can control aspects like the maximum number of test cases, the deadline for each test execution, and verbosity levels:

from datetime import timedelta

from hypothesis import given, settings
from snowflake.snowpark import DataFrame, Session

from snowflake.hypothesis_snowpark import dataframe_strategy


@given(
    df=dataframe_strategy(
        schema="path/to/file.json",
        session=Session.builder.getOrCreate(),
    )
)
@settings(
    deadline=timedelta(milliseconds=800),
    max_examples=25,
)
def test_my_function(df: DataFrame):
    # Test a particular function using the generated Snowpark DataFrame
    ...

For more information, see Hypothesis settings (https://hypothesis.readthedocs.io/en/latest/settings.html).