Using third-party packages

Stages can be used to import third-party packages. You can also specify Anaconda packages to install when you create Python UDFs.

Importing packages through a Snowflake stage

Snowflake stages can be used to import packages. You can bring in any Python code that follows guidelines defined in General limitations. For more information, see Creating a Python UDF with code uploaded from a stage.

You can only upload pure Python packages or packages with native code through a Snowflake stage. If your uploaded package has dependency on x86 CPU architecture, then you must use Snowpark-optimized warehouses and use the RESOURCE_CONSTRAINT warehouse property, with CPU architecture x86.

As an example, you can use the following SQL, which creates a warehouse named so_warehouse that has x86 CPU architecture:

CREATE WAREHOUSE so_warehouse WITH
   WAREHOUSE_SIZE = 'LARGE'
   WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
   RESOURCE_CONSTRAINT = 'MEMORY_16X_X86';
Copy

To install a package with native code via importing from Stage, use the following example:

create or replace function native_module_test_zip()
returns string
language python
runtime_version=3.9
handler='compute'
imports=('@mystage/mycustompackage.zip')
as
$$
import os
import sys
def compute():
   from zipfile import ZipFile
   if os.path.exists('/tmp/mycustompackage'):
      shutil.rmtree('/tmp/mycustompackage')
   os.mkdir('/tmp/mycustompackage')
   zippath = os.path.join(sys._xoptions["snowflake_import_directory"], 'mycustompackage.zip')
   ZipFile(zippath).extractall('/tmp/mycustompackage')
   sys.path.insert(0, '/tmp/mycustompackage')
   #return os.listdir('/tmp/mycustompackage')
   import mycustompackage
   return mycustompackage.mycustompackage()
$$;
Copy

Using third-party packages from Anaconda

For convenience, a number of popular open source third-party Python packages that are built and provided by Anaconda are made available to use out of the box inside Snowflake virtual warehouses. There is no additional cost for such use of the Anaconda packages apart from Snowflake’s standard consumption-based pricing. With the exception of Snowflake Notebooks, Anaconda packages are currently not eligible for use within Snowpark Container Services (SPCS). To use Python packages in a container image for SPCS, you can install these packages from PyPi using pip.

To view the list of third-party packages from Anaconda, see the Anaconda Snowflake channel (https://repo.anaconda.com/pkgs/snowflake).

To request the addition of new packages, go to the Snowflake Ideas (https://community.snowflake.com/s/ideas/) page in the Snowflake Community. Select the Python Packages & Libraries category and check if someone has already submitted a request. If so, vote on it. Otherwise, click New Idea and submit your suggestion.

Getting started

Before you start using the packages provided by Anaconda inside Snowflake, you must acknowledge the External Offerings Terms.

Note

You must use the ORGADMIN role to accept the terms. You only need to accept the External Offerings Terms once for your Snowflake account. If you do not have access to the ORGADMIN role, see Enabling the ORGADMIN role in an account.

  1. Sign in to Snowsight.

  2. Select Admin » Billing & Terms.

  3. In the Anaconda section, select Enable.

  4. In the Anaconda Packages dialog, click the link to review the External Offerings Terms page.

  5. If you agree to the terms, select Acknowledge & Continue.

If you encounter an error when attempting to accept the External Offerings Terms, it may be due to missing information in your user profile, such as a first name, last name, or email address. If you have administrator privileges, see Add user details to your user profile to update your profile using Snowsight. Otherwise, contact an administrator to update your account.

Displaying and using packages

Displaying available packages

You can display all packages available and their version information by querying the PACKAGES view in the Information Schema.

select * from information_schema.packages where language = 'python';
Copy

To display version information about a specific package, for example numpy, use this command:

select * from information_schema.packages where (package_name = 'numpy' and language = 'python');
Copy

Note

Some packages in the Anaconda Snowflake channel are not intended for use inside Snowflake UDFs because UDFs are executed within a restricted engine. For more information, see Following good security practices.

When queries that call Python UDFs are executed inside a Snowflake warehouse, Anaconda packages are installed seamlessly and cached on the virtual warehouse on your behalf.

Displaying imported packages

You can display a list of the packages and modules a UDF or UDTF is using by executing the DESCRIBE FUNCTION command. Executing the DESCRIBE FUNCTION command for a UDF whose handler is implemented in Python returns the values of several properties, including a list of imported modules and packages, as well as installed packages, the function signature, and its return type.

When specifying the identifier for the UDF, be sure to include function parameter types, if any.

desc function stock_sale_average(varchar, number, number);
Copy

Using Anaconda packages

For an example of how to use an imported Anaconda package in a Python UDF, refer to Importing a package in an in-line handler.

Setting packages policies

You can use a packages policy to set allowlists and blocklists for third-party Python packages from Anaconda at the account level. This lets you meet stricter auditing and security requirements and gives you more fine-grained control over which packages are available or blocked in your environment. For more information, see Packages policies.

Performance on cold warehouses

For more efficient resource management, newly provisioned virtual warehouses do not preinstall Anaconda packages. Instead, Anaconda packages are installed on-demand the first time a UDF is used. The packages are cached for future UDF execution on the same warehouse. The cache is dropped when the warehouse is suspended. This may result in slower performance the first time a UDF is used or after the warehouse is resumed. The additional latency could be approximately 30 seconds.

Local development and testing

To help you create a conda environment on your local machine for development and testing, Anaconda has created a Snowflake channel which mirrors a subset of the packages and versions that are supported in the Snowflake Python UDF environment. You may use the Snowflake conda channel for local testing and development at no cost under the Supplemental Embedded Software Terms to Anaconda’s Terms of Service.

For example, to create a new conda environment locally using the Snowflake channel, type something like this on the command line:

conda create --name py38_env -c https://repo.anaconda.com/pkgs/snowflake python=3.9 numpy pandas
Copy

Note that because of platform differences, your local conda environment may not be exactly the same as the server environment.

Best practices

Within the create function statement, the package specification (for example, packages = ('numpy','pandas')) should only specify the top-level packages that the UDF is using directly. Anaconda performs dependency management of packages and will install the required dependencies automatically. Because of this, you should not specify dependency packages.

Anaconda will install the most up-to-date version of the package and its dependencies if you don’t specify a package version. Generally, it isn’t necessary to specify a particular package version. Note that version resolution is performed once, when the UDF is created using the create function command. After that, the resulting version resolution is frozen and the same set of packages will be used when this particular UDF executes.

For an example of how to use the package specification within the create function statement, see Importing a package in an in-line handler.

Known issues with third-party packages

Performance with single row prediction

Some data science frameworks, such as Scikit-learn and TensorFlow, might be slow when doing single-row ML prediction. To improve performance, do batch prediction instead of single-row prediction. To do this, you can use vectorized Python UDFs, with which you can define Python functions that receive input rows in batches, on which machine learning or data science libraries are optimized to operate. For more information, see Vectorized Python UDFs.

Downloading data on demand from data science libraries

Some data science libraries, such as NLTK (https://www.nltk.org/data.html), Keras (https://www.tensorflow.org/api_docs/python/tf/keras/datasets), and spaCy (https://spacy.io) provide functionality to download additional corpora, data, or models on demand.

However, on-demand downloading does not work with Python UDFs due to Snowflake security constraints, which disable some capabilities, such as network access and writing to files.

To work around this issue, download the data to your local environment and then provide it to the UDF via a Snowflake stage.

XGBoost

When using XGBoost in UDF or UDTF for parallel prediction or training, the concurrency for each XGBoost instance should be set to 1. This ensures that XGBoost is configured for optimal performance when executing in the Snowflake environment.

Examples:

import xgboost as xgb
model = xgb.Booster()
model.set_param('nthread', 1)
model.load_model(...)
Copy
import xgboost as xgb
model = xgb.XGBRegressor(n_jobs=1)
Copy

TensorFlow/Keras

When using Tensorflow/Keras for prediction, use Model.predict_on_batch and not Model.predict.

Example:

import keras
model = keras.models.load_model(...)
model.predict_on_batch(np.array([input]))
Copy
Language: English