Vectorized Python UDFs¶
This topic introduces vectorized Python UDFs.
Overview¶
Vectorized Python UDFs let you define Python functions that receive batches of input rows as Pandas DataFrames (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and return batches of results as Pandas arrays (https://pandas.pydata.org/docs/reference/api/pandas.array.html) or Series (https://pandas.pydata.org/docs/reference/series.html). You call vectorized Python UDFs the same way you call other Python UDFs.
Advantages of using vectorized Python UDFs compared to the default row-by-row processing pattern include:
The potential for better performance if your Python code operates efficiently on batches of rows.
Less transformation logic required if you are calling into libraries that operate on Pandas DataFrames or Pandas arrays.
When you use vectorized Python UDFs:
You do not need to change how you write queries using Python UDFs. All batching is handled by the UDF framework rather than your own code.
As with non-vectorized UDFs, there is no guarantee of which instances of your handler code will see which batches of input.
Getting started with vectorized Python UDFs¶
To create a vectorized Python UDF, use one of the supported mechanisms for annotating your handler function.
Using the vectorized
decorator¶
The _snowflake
module is exposed to Python UDFs that execute within Snowflake. In your Python code, import the _snowflake
module,
and use the vectorized
decorator to specify that your handler expects to receive a Pandas DataFrame by setting the input
parameter to pandas.DataFrame
.
create function add_one_to_inputs(x number(10, 0), y number(10, 0))
returns number(10, 0)
language python
RUNTIME_VERSION = 3.9
packages = ('pandas')
handler = 'add_one_to_inputs'
as $$
import pandas
from _snowflake import vectorized
@vectorized(input=pandas.DataFrame)
def add_one_to_inputs(df):
return df[0] + df[1] + 1
$$;
Using a function attribute¶
Rather than importing the _snowflake module and using the vectorized
decorator, you can set the special _sf_vectorized_input
attribute on your handler function.
create function add_one_to_inputs(x number(10, 0), y number(10, 0))
returns number(10, 0)
language python
RUNTIME_VERSION = 3.9
packages = ('pandas')
handler = 'add_one_to_inputs'
as $$
import pandas
def add_one_to_inputs(df):
return df[0] + df[1] + 1
add_one_to_inputs._sf_vectorized_input = pandas.DataFrame
$$;
Setting a target batch size¶
Calls to the Python handler function must execute within a time limit,
which is 180 seconds, and each DataFrame passed as input to the handler function may currently contain
up to a few thousand rows. In order to stay within the time limit, you may want to set the target batch
size for your handler function, which imposes a maximum number of rows per input DataFrame.
Note that setting a larger value does not guarantee that Snowflake will encode batches with the specified number of rows.
You can set the target batch size using either the vectorized
decorator or an attribute on the function.
Note
Using max_batch_size
is only meant as a mechanism to limit the number of rows that UDF can handle per single batch.
For example, if the UDF is written in a way that can only process at most 100 rows at a time, then max_batch_size
should be set to 100.
Setting max_batch_size
is not meant to be used as a mechanism to specify arbitrary large batch sizes.
If the UDF is able to process batches of any size, it is recommended to leave this parameter unset.
Using the vectorized
decorator¶
To set the target batch size using the vectorized
decorator, pass a positive integer value for the argument named max_batch_size
.
As an example, this statement creates a vectorized Python UDF and limits each Dataframe to a maximum of 100 rows:
create function add_one_to_inputs(x number(10, 0), y number(10, 0))
returns number(10, 0)
language python
RUNTIME_VERSION = 3.9
packages = ('pandas')
handler = 'add_one_to_inputs'
as $$
import pandas
from _snowflake import vectorized
@vectorized(input=pandas.DataFrame, max_batch_size=100)
def add_one_to_inputs(df):
return df[0] + df[1] + 1
$$;
Using a function attribute¶
To set the target batch size using a function attribute, set a positive integer value for the _sf_max_batch_size
attribute on your handler function.
As an example, this statement creates a vectorized Python UDF and limits each DataFrame to a maximum of 100 rows:
create function add_one_to_inputs(x number(10, 0), y number(10, 0))
returns number(10, 0)
language python
RUNTIME_VERSION = 3.9
packages = ('pandas')
handler = 'add_one_to_inputs'
as $$
import pandas
def add_one_to_inputs(df):
return df[0] + df[1] + 1
add_one_to_inputs._sf_vectorized_input = pandas.DataFrame
add_one_to_inputs._sf_max_batch_size = 100
$$;
DataFrame encoding¶
Batches of arguments to the UDF are encoded as arrays in the input Pandas DataFrames, and the number of rows in each DataFrame may vary. For more information, see Setting a target batch size. Arguments can be accessed in the DataFrame by their index, i.e. the first argument has an index of 0, the second has an index of 1, and so on. The Pandas array or Series that the UDF handler returns must have the same length as that of the input DataFrame.
To illustrate, suppose that you define a vectorized Python UDF as follows:
create or replace function add_inputs(x int, y float)
returns float
language python
RUNTIME_VERSION = 3.9
packages = ('pandas')
handler = 'add_inputs'
as $$
import pandas
from _snowflake import vectorized
@vectorized(input=pandas.DataFrame)
def add_inputs(df):
return df[0] + df[1]
$$;
This UDF uses df[0]
to access the Pandas array for the first argument, and df[1]
for the second. df[0] + df[1]
results in a Pandas array with the pairwise sums of corresponding elements from the two arrays. After creating the UDF, you might call it with some input rows:
select add_inputs(x, y)
from (
select 1 as x, 3.14::float as y union all
select 2, 1.59 union all
select 3, -0.5
);
+------------------+
| ADD_INPUTS(X, Y) |
|------------------|
| 4.14 |
| 3.59 |
| 2.5 |
+------------------+
Here the add_inputs
Python function receives a DataFrame analogous to one created with the following Python code:
>>> import pandas
>>> df = pandas.DataFrame({0: pandas.array([1, 2, 3]), 1: pandas.array([3.14, 1.59, -0.5])})
>>> df
0 1
0 1 3.14
1 2 1.59
2 3 -0.50
The line return df[0] + df[1]
in the handler function results in an array similar to the following Python code:
>>> df[0] + df[1]
0 4.14
1 3.59
2 2.50
dtype: float64
Type support¶
Vectorized Python UDFs support the following SQL types for arguments and return values. The table reflects how each SQL argument is encoded as a Pandas array of a particular dtype (https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes).
SQL Type |
Pandas dtype |
Notes |
---|---|---|
NUMBER |
|
To ensure that an input argument to a UDF is interpreted as not nullable, pass a column from a table created using the |
FLOAT |
|
NULL values are encoded as NaN values. In the output, NaN values are interpreted as NULLs. |
BOOLEAN |
|
|
VARCHAR |
|
Both Snowflake SQL and Pandas represent strings using UTF-8 encoding. |
BINARY |
|
|
DATE |
|
Each value is encoded as a |
VARIANT |
|
Each variant row is converted to a Python type dynamically for arguments and vice versa for return values. The following types are converted to strings rather than native Python types: |
OBJECT |
|
|
ARRAY |
|
|
TIME |
|
Each value is encoded as an offset from midnight. NULL values are encoded as |
TIMESTAMP_LTZ |
|
Uses the local time zone to encode each value as a nanosecond-scale |
TIMESTAMP_NTZ |
|
Encodes each value as a nanosecond-scale |
TIMESTAMP_TZ |
|
Encodes each value as a nanosecond-scale |
GEOGRAPHY |
|
Formats each value as GeoJSON and then converts it to a Python |
The following types are accepted as output: Pandas Series
or array
, NumPy array
, regular
Python list
, and any iterable sequence that contains the expected types described
in Type support. It is efficient to use Pandas Series
and array
and NumPy array
where the dtype is bool
, boolean
,
int16
, int32
, int64
, Int16
, Int32
, Int64
, or float64
because they expose their contents as memoryviews
. This means that the contents can be copied rather than each value
being read sequentially.