snowflake.snowpark.functions.pandas_udtf¶
- snowflake.snowpark.functions.pandas_udtf(handler: Optional[Callable] = None, *, output_schema: Union[StructType, List[str], PandasDataFrameType], input_types: Optional[List[DataType]] = None, input_names: Optional[List[str]] = None, name: Optional[Union[str, Iterable[str]]] = None, is_permanent: bool = False, stage_location: Optional[str] = None, imports: Optional[List[Union[str, Tuple[str, str]]]] = None, packages: Optional[List[Union[str, module]]] = None, replace: bool = False, if_not_exists: bool = False, session: Optional[Session] = None, parallel: int = 4, statement_params: Optional[Dict[str, str]] = None, strict: bool = False, secure: bool = False, external_access_integrations: Optional[List[str]] = None, secrets: Optional[Dict[str, str]] = None, immutable: bool = False, max_batch_size: Optional[int] = None, comment: Optional[str] = None, **kwargs) Union[UserDefinedTableFunction, partial][source] (https://github.com/snowflakedb/snowpark-python/blob/v1.30.0/snowpark-python/src/snowflake/snowpark/functions.py#L10080-L10215)¶
- Registers a Python class as a vectorized Python UDTF and returns the UDTF. - The arguments, return value and usage of this function are exactly the same as - udtf(), but this function can only be used for registering vectorized UDTFs. See examples in- UDTFRegistration.- See also 
- UDTFRegistration.register()
 - Compared to the default row-by-row processing pattern of a normal UDTF, which sometimes is inefficient, vectorized Python UDTFs (user-defined table functions) enable seamless partition-by-partition processing by operating on partitions as pandas DataFrames (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and returning results as pandas DataFrames (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) or lists of pandas arrays (https://pandas.pydata.org/docs/reference/api/pandas.array.html) or pandas Series (https://pandas.pydata.org/docs/reference/series.html). - In addition, vectorized Python UDTFs allow for easy integration with libraries that operate on pandas DataFrames or pandas arrays. - A vectorized UDTF handler class: - defines an - end_partitionmethod that takes in a DataFrame argument and returns a- pandas.DataFrameor a tuple of- pandas.Seriesor- pandas.arrayswhere each array is a column. - does NOT define a- processmethod. - optionally defines a handler class with an- __init__method which will be invoked before processing each partition.- You can use - udtf(),- register()or- pandas_udtf()to create a vectorized UDTF by providing appropriate return and input types. If you would like to use- register_from_file()to create a vectorized UDTF, you need to explicitly mark the handler method as vectorized using either the decorator @vectorized(input=pandas.DataFrame) or setting <class>.end_partition._sf_vectorized_input = pandas.DataFrame- Note: A vectorized UDTF must be called with ~snowflake.snowpark.Window.partition_by to build the partitions. - Example::
- >>> from snowflake.snowpark.types import PandasSeriesType, PandasDataFrameType, IntegerType >>> class multiply: ... def __init__(self): ... self.multiplier = 10 ... def end_partition(self, df): ... df.col1 = df.col1*self.multiplier ... df.col2 = df.col2*self.multiplier ... yield df >>> multiply_udtf = pandas_udtf( ... multiply, ... output_schema=PandasDataFrameType([StringType(), IntegerType(), FloatType()], ["id_", "col1_", "col2_"]), ... input_types=[PandasDataFrameType([StringType(), IntegerType(), FloatType()])], ... input_names=['"id"', '"col1"', '"col2"'] ... ) >>> df = session.create_dataframe([['x', 3, 35.9],['x', 9, 20.5]], schema=["id", "col1", "col2"]) >>> df.select(multiply_udtf("id", "col1", "col2").over(partition_by=["id"])).sort("col1_").show() ----------------------------- |"ID_" |"COL1_" |"COL2_" | ----------------------------- |x |30 |359.0 | |x |90 |205.0 | ----------------------------- 
 - Example: - >>> @pandas_udtf( ... output_schema=PandasDataFrameType([StringType(), IntegerType(), FloatType()], ["id_", "col1_", "col2_"]), ... input_types=[PandasDataFrameType([StringType(), IntegerType(), FloatType()])], ... input_names=['"id"', '"col1"', '"col2"'] ... ) ... class _multiply: ... def __init__(self): ... self.multiplier = 10 ... def end_partition(self, df): ... df.col1 = df.col1*self.multiplier ... df.col2 = df.col2*self.multiplier ... yield df >>> df.select(multiply_udtf("id", "col1", "col2").over(partition_by=["id"])).sort("col1_").show() ----------------------------- |"ID_" |"COL1_" |"COL2_" | ----------------------------- |x |30 |359.0 | |x |90 |205.0 | -----------------------------