Snowflake ML Model Development¶
Note
The Snowflake ML Modeling API is Generally Available as of snowflake-ml-python
package version 1.1.1.
The Snowflake ML Modeling API uses familiar Python frameworks such as scikit-learn, LightGBM, and XGBoost for preprocessing data, feature engineering, and training models inside Snowflake.
Benefits of developing models with Snowflake ML Modeling include:
Feature engineering and preprocessing: Improve performance and scalability with distributed execution for frequently-used scikit-learn preprocessing functions.
Model training: Accelerate training for scikit-learn, XGBoost and LightGBM models without the need to manually create stored procedures or user-defined functions (UDFs), leveraging distributed hyperparameter optimization.
Tip
See Introduction to Machine Learning (https://quickstarts.snowflake.com/guide/intro_to_machine_learning_with_snowpark_ml_for_python/#0) for an example of an end-to-end ML workflow, including the modeling API.
Note
This topic assumes that snowflake-ml-python
and its modeling dependencies are already installed. See
Using Snowflake ML Locally.
Developing models¶
With Container Runtime for ML, available in Notebooks on Container Runtime, you can use popular open-source ML packages vith your Snowflake data, leveraging one or more GPU nodes, within the Snowflake cloud, ensuring security and governance for the entire ML workflow. The included data loading and training APIs are automatically distributed across all available CPUs or GPUs on a node, acelerating model training with large datasets.
For more information, see Getting Started with Snowflake Notebook Container Runtime (https://quickstarts.snowflake.com/guide/notebook-container-runtime/index.html#0), which presents a simple ML workflow leveraging the capabilities of the Container Runtime for ML.
Along with the flexibility and power of the Container Runtime for ML, the Snowflake ML Modeling API provides estimators and transformers that have APIs similar to those in the scikit-learn, xgboost, and lightgbm libraries. You can use these APIs to build and train machine learning models that can be used with Snowflake ML Operations such as the Snowpark Model Registry.
Examples¶
Review the following examples to get a sense of the similarities of the Snowflake Modeling API to the machine learning libraries you might be familiar with.
Preprocessing¶
This example illustrates the using Snowflake Modeling data preprocessing and transformation functions. The two
preprocessing functions used in the example (MixMaxScaler
and OrdinalEncoder
) use Snowflake’s distributed
processing engine to provide significant performance improvements over client-side or stored procedure
implementations. For details, see Distributed Preprocessing.
import numpy as np
import pandas as pd
import random
import string
from sklearn.datasets import make_regression
from snowflake.ml.modeling.preprocessing import MinMaxScaler, OrdinalEncoder
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.snowpark import Session
# Create a session with your preferred method
# session =
NUMERICAL_COLS = ["X1", "X2", "X3"]
CATEGORICAL_COLS = ["C1", "C2", "C3"]
FEATURE_COLS = NUMERICAL_COLS + CATEGORICAL_COLS
CATEGORICAL_OUTPUT_COLS = ["C1_OUT", "C2_OUT", "C3_OUT"]
FEATURE_OUTPUT_COLS = ["X1_FEAT_OUT", "X2_FEAT_OUT", "X3_FEAT_OUT", "C1_FEAT_OUT", "C2_FEAT_OUT", "C3_FEAT_OUT"]
# Create a dataset with numerical and categorical features
X, _ = make_regression(
n_samples=1000,
n_features=3,
noise=0.1,
random_state=0,
)
X = pd.DataFrame(X, columns=NUMERICAL_COLS)
def generate_random_string(length):
return "".join(random.choices(string.ascii_uppercase, k=length))
categorical_feature_length = 2
categorical_features = {}
for c in CATEGORICAL_COLS:
categorical_column = [generate_random_string(categorical_feature_length) for _ in range(X.shape[0])]
categorical_features[c] = categorical_column
X = X.assign(**categorical_features)
features_df = session.create_dataframe(X)
# Fit a pipeline with OrdinalEncoder and MinMaxScaler on Snowflake
pipeline = Pipeline(
steps=[
(
"OE",
OrdinalEncoder(
input_cols=CATEGORICAL_COLS,
output_cols=CATEGORICAL_OUTPUT_COLS,
)
),
(
"MMS",
MinMaxScaler(
input_cols=NUMERICAL_COLS + CATEGORICAL_OUTPUT_COLS,
output_cols=FEATURE_OUTPUT_COLS,
)
),
]
)
pipeline.fit(features_df)
# Use the pipeline to transform a dataset.
result = pipeline.transform(features_df)
Data Loading¶
This example shows how to load data from a Snowflake table to a pandas DataFrame or a pytorch Dataset using the DataConnector API, which disributes data ingestion over multiple cores or GPUs to speed up loading.
Note
The DataConnector
API is available in the Container Runtime for ML and can be
used from Snowsight notebooks running on Snowpark Container Services (SPCS).
from snowflake.ml.data.data_connector import DataConnector
# Retrieve data from a snowflake table
table_name = 'LARGE_TABLE_MULTIPLE_GBs'
snowpark_df = session.table(table_name)
# Materialize it into a pandas dataframe using DataConnector
pandas_df = DataConnector.from_dataframe(snowpark_df).to_pandas()
# Materialize it into a pytroch dataset using DataConnector
torch_dataset = data.to_torch_dataset(batch_size=1024)
Training¶
This example shows how to train a simple xgboost classifier model using Snowflake ML Modeling, then run predictions. The API is similar to xgboost here, with only a few differences in how the columns are specified. For details on these differences, see General API Differences.
import pandas as pd
from sklearn.datasets import make_classification
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session
# Create a session with your preferred method
# session =
FEATURE_COLS = ["X1", "X2", "X3", "X4", "X5", "X6"]
LABEL_COLS = ["Y"]
OUTPUT_COLS = ["PREDICTIONS"]
# Set up data.
X, y = make_classification(
n_samples=40000,
n_features=6,
n_informative=4,
n_redundant=1,
random_state=0,
shuffle=True,
)
X = pd.DataFrame(X, columns=FEATURE_COLS)
y = pd.DataFrame(y, columns=LABEL_COLS)
features_pandas = pd.concat([X, y], axis=1)
features_df = session.create_dataframe(features_pandas)
# Train an XGBoost model on snowflake.
xgboost_model = XGBClassifier(
input_cols=FEATURE_COLS,
label_cols=LABEL_COLS,
output_cols=OUTPUT_COLS
)
xgboost_model.fit(features_df)
# Use the model to make predictions.
predictions = xgboost_model.predict(features_df)
predictions[OUTPUT_COLS].show()
Feature Preprocessing and Training on Non-Synthetic Data¶
This example uses the high-energy gamma particle data from a ground-based atmospheric Cherenkov telescope. The telescope observes high energy gamma particles, taking advantage of the radiation emitted by charged particles produced in the electromagnetic showers initiated by the gamma rays. The detector records the Cherenkov radiation (of visible to ultraviolet wavelengths) that leaks through the atmosphere, allowing reconstruction of the gamma shower parameters. The telescope also detects hadron rays that are abundant in cosmic showers and produce signals that mimic gamma rays.
The goal is to develop a classification model for distinguishing between gamma rays and hadron rays. The model enables scientists to filter out background noise and focus on the genuine gamma-ray signals. Gamma rays allow scientists to observe cosmic events like the birth and death of stars, cosmic explosions, and the behavior of matter in extreme conditions.
The particle data is available for download from MAGIC Gamma Telescope (https://archive.ics.uci.edu/dataset/159/magic+gamma+telescope).
Download and unzip the data, set the DATA_FILE_PATH
variable to point to the data file, and run the code below to load it to Snowflake.
DATA_FILE_PATH = "~/Downloads/magic+gamma+telescope/magic04.data"
# Setup
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session
import posixpath
import os
##
# Note: Create session https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/api/snowflake.snowpark.Session
##
session = Session.builder.configs(SnowflakeLoginOptions()).create()
session.sql("""
CREATE OR REPLACE TABLE Gamma_Telescope_Data(
F_LENGTH FLOAT,
F_WIDTH FLOAT,
F_SIZE FLOAT,
F_CONC FLOAT,
F_CONC1 FLOAT,
F_ASYM FLOAT,
F_M3_LONG FLOAT,
F_M3_TRANS FLOAT,
F_ALPHA FLOAT,
F_DIST FLOAT,
CLASS VARCHAR(10))
""").collect()
session.sql("CREATE OR REPLACE STAGE SNOWPARK_ML_TEST_DATA_STAGE").collect()
session.file.put(
DATA_FILE_PATH,
"SNOWPARK_ML_TEST_DATA_STAGE/magic04.data",
auto_compress=False,
overwrite=True,
)
session.sql("""
COPY INTO Gamma_Telescope_Data FROM @SNOWPARK_ML_TEST_DATA_STAGE/magic04.data
FILE_FORMAT = (TYPE = 'CSV' field_optionally_enclosed_by='"',SKIP_HEADER = 0);
""").collect()
session.sql("select * from Gamma_Telescope_Data limit 5").collect()
Once you have loaded the data, use the following code to train and predict, using the following steps.
Preprocess the data:
Replace missing values with the mean.
Center the data using a standard scaler.
Train an xgboost classifier to determine the type of events.
Test the accuracy of the model on both training and test datasets.
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session, DataFrame
from snowflake.ml.modeling.preprocessing import StandardScaler
from snowflake.ml.modeling.impute import SimpleImputer
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.metrics import accuracy_score
##
# Note: Create session https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/api/snowflake.snowpark.Session
##
session = Session.builder.configs(SnowflakeLoginOptions()).create()
# Step 1: Create train and test dataframes
all_data = session.sql("select *, IFF(CLASS = 'g', 1.0, 0.0) as LABEL from Gamma_Telescope_Data").drop("CLASS")
train_data, test_data = all_data.random_split(weights=[0.9, 0.1], seed=0)
# Step 2: Construct training pipeline with preprocessing and modeling steps
FEATURE_COLS = [c for c in train_data.columns if c != "LABEL"]
LABEL_COLS = ["LABEL"]
pipeline = Pipeline(steps = [
("impute", SimpleImputer(input_cols=FEATURE_COLS, output_cols=FEATURE_COLS)),
("scaler", StandardScaler(input_cols=FEATURE_COLS, output_cols=FEATURE_COLS)),
("model", XGBClassifier(input_cols=FEATURE_COLS, label_cols=LABEL_COLS))
])
# Step 3: Train
pipeline.fit(train_data)
# Step 4: Eval
predict_on_training_data = pipeline.predict(train_data)
training_accuracy = accuracy_score(df=predict_on_training_data, y_true_col_names=["LABEL"], y_pred_col_names=["OUTPUT_LABEL"])
predict_on_test_data = pipeline.predict(test_data)
eval_accuracy = accuracy_score(df=predict_on_test_data, y_true_col_names=["LABEL"], y_pred_col_names=["OUTPUT_LABEL"])
print(f"Training accuracy: {training_accuracy} \nEval accuracy: {eval_accuracy}")
Distributed Hyperparameter Optimization¶
This example shows how to run distributed hyperparameter optimization using Snowflake’s implementation of
scikit-learn’s GridSearchCV
. The individual runs are executed in parallel using distributed warehouse compute
resources. For details on distributed hyperparameter optimization, see Distributed Hyperparameter Optimization.
from snowflake.snowpark import Session, DataFrame
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from sklearn.datasets import make_classification
from snowflake.snowpark import Session, DataFrame
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.model_selection.grid_search_cv import GridSearchCV
FEATURE_COLS = ["X1", "X2", "X3", "X4", "X5", "X6"]
LABEL_COLS = ["Y"]
OUTPUT_COLS = ["PREDICTIONS"]
# Create a session using your favorite login option.
# In this example we use a session builder with `SnowflakeLoginOptions`.
session = Session.builder.configs(SnowflakeLoginOptions()).create()
# Set up data.
def set_up_data(session: Session, n_samples: int) -> DataFrame:
X, y = make_classification(
n_samples=n_samples,
n_features=6,
n_informative=2,
n_redundant=0,
random_state=0,
shuffle=True,
)
X = pd.DataFrame(X, columns=FEATURE_COLS)
y = pd.DataFrame(y, columns=LABEL_COLS)
features_pandas = pd.concat([X, y], axis=1)
features_pandas.head()
features_df = session.create_dataframe(features_pandas)
return features_df
features_df = set_up_data(session, 10**4)
# Create a warehouse to use for the tuning job.
session.sql(
"""
CREATE or replace warehouse HYPERPARAM_WH
WITH WAREHOUSE_SIZE = 'X-SMALL'
WAREHOUSE_TYPE = 'Standard'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = FALSE;"""
).collect()
session.use_warehouse("HYPERPARAM_WH")
# Tune an XGB Classifier model using sklearn GridSearchCV.
DISTRIBUTIONS = dict(
n_estimators=[10, 50],
learning_rate=[0.01, 0.1, 0.2],
)
estimator = XGBClassifier()
grid_search_cv = GridSearchCV(estimator=estimator, param_grid=DISTRIBUTIONS, input_cols=FEATURE_COLS, label_cols=LABEL_COLS, output_cols=OUTPUT_COLS)
grid_search_cv.fit(features_df)
# Use the best model to make predictions.
predictions = grid_search_cv.predict(features_df)
predictions[OUTPUT_COLS].show()
# Retrieve sklearn model, and print the best score
sklearn_grid_search_cv = grid_search_cv.to_sklearn()
print(sklearn_grid_search_cv.best_score_)
To really see the power of distributed optimization, train on a million rows of data.
large_features_df = set_up_data(session, 10**6)
# Scale up the warehouse for a faster fit. This takes 2m15s to run on an L warehouse versus 4m5s on a XS warehouse.
session.sql(f"ALTER WAREHOUSE {session.get_current_warehouse()} SET WAREHOUSE_SIZE='LARGE'").collect()
grid_search_cv.fit(large_features_df)
print(grid_search_cv.to_sklearn().best_score_)
Snowflake Modeling Classes¶
All Snowflake modeling and preprocessing classes are in the snowflake.ml.modeling
namespace. The
snowflake-ml-python
modules have the same name as the corresponding modules from the sklearn
namespace. For
example, the module corresponding to sklearn.calibration
is
snowflake.ml.modeling.calibration
. The xgboost
and lightgbm
modules correspond to
snowflake.ml.modeling.xgboost
and snowflake.ml.modeling.lightgbm
, respectively.
The modeling API provides wrappers for underlying scikit-learn, xgboost, and lightgbm classes, the majority of which are executed as stored procedures (running on a single warehouse node) in the virtual warehouse. Not all of the classes from scikit-learn are supported. See the Python API Reference for a list of the classes currently available.
Some classes (including preprocessing and metrics classes) support distributed execution and may provide significant performance benefits compared to running the same operations locally. For more information, see Distributed Preprocessing and Distributed Hyperparameter Optimization. The table below lists the specific classes that support distributed execution.
|
Distributed classes |
---|---|
|
|
|
|
|
|
|
|
General API Differences¶
Tip
See the API Reference for complete details of the modeling API.
Snowflake modeling classes includes data preprocessing, transformation, and prediction algorithms based on scikit-learn, xgboost, and lightgbm. The Snowpark Python classes are replacements for the corresponding classes from the original packages, with similar signatures. However, these APIs are designed to work with Snowpark DataFrames instead of NumPy arrays.
Although the API is similar to scikit-learn, there are some key differences. This section explains how to call the
__init__
(constructor), fit
, and predict
methods for the Snowflake estimator and transformer classes.
The constructor of all Snowflake model classes accepts five additional parameters (
input_cols
,output_cols
,sample_weight_col
,label_cols
, anddrop_input_cols
) in addition to the parameters accepted by the equivalent classes in scikit-learn, xgboost, or lightgbm. These are strings or sequences of strings that specify the names of the input columns, output columns, sample weight column, and label columns in a Snowpark or Pandas DataFrame. If some of the datasets you use have different names, you can change these names after instantiation using one of the provided setter methods, such asset_input_cols
.Because you specify column names when instantiating the class (or afterward, using setter methods) the
fit
andpredict
methods accept a single DataFrame instead of separate arrays for inputs, weights, and labels. The provided column names are used to access the appropriate column from the DataFrame infit
orpredict
. See fit and predict.By default, the
transform
andpredict
methods return a DataFrame containing all of the columns from the DataFrame passed to the method, with the output from the prediction stored in additional columns. You can transform in place by specifying output column names that match the input column names, or drop the input columns by passingdrop_input_cols = True
.) The scikit-learn, xgboost, and lightgbm equivalents return arrays containing only the results.Snowpark Python transformers do not have a
fit_transform
method. However, as with scikit-learn, parameter validation is only performed in thefit
method, so you should callfit
at some point beforetransform
, even when the transformer does not do any fitting.fit
returns the transformer, so the method calls may be chained; for example,Binarizer(threshold=0.5).fit(df).transform(df)
.Snowflake transformers do not currently have an
inverse_transform
method. In many use cases, this method is unnecessary because the input columns are retained in the output dataframe by default.
You can convert any Snowfalke modeling object to the corresponding scikit-learn, xgboost, or lightgbm object, allowing you to use all the methods and attributes of the underlying type. See Retrieving the Underlying Model.
Constructing a Model¶
In addition to the parameters accepted by individual scikit-learn model classes, all modeling classes accept the following additional parameters at instantiation.
These parameters are all technically optional, but you will often want to specify input_cols
,
output_cols
, or both. label_cols
and sample_weight_col
are required in specific situations noted
in the table, but can be omitted in other cases.
Tip
All column names must follow the Snowflake identifier requirements. To preserve case or use special characters (besides dollar sign and underscore) when creating a table, column names must be wrapped in double quotes. Use all-caps column names whenever possible to maintain compatibility with case-sensitive Pandas DataFrames.
from snowflake.ml.modeling.preprocessing import MinMaxScaler
from snowflake.snowpark import Session
# Snowflake identifiers are not case sensitive by default.
# These column names will be automatically updated to ["COLUMN_1", "COLUMN_2", "COLUMN_3"] by the Snowpark DataFrame.
schema = ["column_1", "column_2", "column_3"]
df = session.create_dataframe([[1, 2, 3]], schema = schema)
df.show()
--------------------------------------
|"COLUMN_1" |"COLUMN_2" |"COLUMN_3"|
--------------------------------------
|1 |2 |3 |
--------------------------------------
# Identify the column names using the Snowflake identifier.
input_cols = ["COLUMN_1", "COLUMN_2", "COLUMN_3"]
mms = MinMaxScaler(input_cols=input_cols)
mms.fit(df)
# To maintain lower case column names, include a double quote within the string.
schema = ['"column_1"', '"column_2"', '"column_3"']
df = session.create_dataframe([[1, 2, 3]], schema = schema)
df.show()
----------------------------------------
|'"column_1"'|'"column_2"'|'"column_3"'|
----------------------------------------
|1 |2 |3 |
----------------------------------------
# Since no conversion took place, the schema labels can be used as the column identifiers.
mms = MinMaxScaler(input_cols=schema)
mms.fit(df)
Parameter |
Description |
---|---|
|
A string or list of strings representing column names that contain features. If you omit this parameter, all columns in the input DataFrame, except the columns specified by |
|
A string or list of strings representing the names of columns that contain labels. You must specify label columns for supervised estimators because inferring these columns is not possible. These label
columns are used as targets for model predictions and should be clearly distinguished from |
|
A string or list of strings representing the names of columns that will store the output of If you omit this parameter, output column names are derived by adding an To transform in place, pass the same names for |
|
A string or a list of strings indicating names of columns to exclude from training, transformation, and inference. Passthrough columns remain untouched between the input and output DataFrames. This option is helpful where you want to avoid using specific columns, such as index columns, during training or
inference, but do not pass |
|
A string representing the column name containing the examples’ weights. This argument is required for weighted datasets. |
|
A Boolean value indicating whether the input columns are removed from the result DataFrame. The default is
|
Example¶
The DecisionTreeClassifier
constructor does not have any required arguments in scikit-learn; all arguments have default
values. So in scikit-learn, you might write:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
In Snowflake’s version of this class, you must specify the column names (or accept the defaults by not specifying them). In this example, they are explicitly specified.
You can initialize a DecisionTreeClassifier
by passing the arguments directly to the constructor or
by setting them as attributes of the model after instantiation. (The attributes may be changed at any time.)
As constructor arguments:
from snowflake.ml.modeling.tree import DecisionTreeClassifier model = DecisionTreeClassifier( input_cols=feature_column_names, label_cols=label_column_names, sample_weight_col=weight_column_name, output_cols=expected_output_column_names )
By setting model attributes:
from snowflake.ml.modeling.tree import DecisionTreeClassifier model = DecisionTreeClassifier() model.set_input_cols(feature_column_names) model.set_label_cols(label_column_names) model.set_sample_weight_col(weight_column_name) model.set_output_cols(output_column_names)
fit
¶
The fit
method of a Snowflake classifier takes a single Snowpark or Pandas DataFrame containing all columns,
including features, labels, and weights. This is different from scikit-learn’s fit
method, which takes separate
inputs for features, labels, and weights.
In scikit-learn, the DecisionTreeClassifier.fit
method call looks like this:
model.fit(
X=df[feature_column_names], y=df[label_column_names], sample_weight=df[weight_column_name]
)
In Snowflake’s fit
, you only need to pass the DataFrame. You have already set the input, label, and weight column names at
initialization or by using setter methods, as shown in Constructing a Model.
model.fit(df)
predict
¶
The predict
method also takes a single Snowpark or Pandas DataFrame containing all
feature columns. The result is a DataFrame that contains all the columns in the input DataFrame unchanged and the output
columns appended. You must extract the output columns from this DataFrame. This is different from the predict
method in scikit-learn, which returns only the results.
Example¶
In scikit-learn, predict
returns only the prediction results:
prediction_results = model.predict(X=df[feature_column_names])
To get only the prediction results in Snowflake’s predict
, extract the output columns from the returned DataFrame. Here,
output_column_names
is a list containing the names of the output columns:
prediction_results = model.predict(df)[output_column_names]
Distributed Training and Inference with SPCS¶
When running in a Snowflake Notebook on Snowpark Container Services (SPCS), model training and inference for these modeling classes are executed on the underlying compute cluster, not in a warehouse, and are transparently distributed across all nodes in the cluster to employ all available compute capability.
Preprocessing and metrics operations are pushed down to the warehouse. Many preprocessing classes support distributed execution when run in the warehouse; see Distributed Preprocessing.
Distributed Preprocessing¶
Many Snowflake data preprocessing and transformation functions are implemented using Snowflake’s distributed execution engine, which yields significant performance benefit compared to single-node execution (that is, stored procedures). To find out which functions support distributed execution, see Snowflake Modeling Classes.
The chart below shows illustrative performance numbers on large public datasets, running in a medium Snowpark-optimized warehouse, comparing scikit-learn running in stored procedures to Snowflake’s distributed implementations. In mary scenarios, your code can run 25 to 50 times faster when using Snowflake modeling classes.
How Fits Are Distributed¶
The fit
method of a Snowflake preprocessing transformer accepts a Snowpark or pandas DataFrame, fits the dataset, and returns the fitted transformer.
For Snowpark DataFrames, distributed fitting uses the SQL engine. The transformer generates SQL queries to compute the necessary states (such as mean, maximum, or count). These queries are then executed by Snowflake, and the results are materialized locally. For complex states that cannot be computed in SQL, the transformer fetches intermediate results from Snowflake and performs local computations over metadata.
For complex transformers that require temporary state tables during transformation (for example,
OneHotEncoder
orOrdinalEncoder
), these tables are represented locally using pandas DataFrames.pandas DataFrames are fitted locally, similar to fitting with scikit-learn. The transformer creates a corresponding scikit-learn transformer with the provided parameters. Then the scikit-learn transformer is fitted, and the Snowflake transformer derives necessary states from the scikit-learn object.
How Transforms Are Distributed¶
The transform
method of a preprocessing transformer accepts a Snowpark or Pandas DataFrame, transforms the
dataset, and returns a transformed dataset.
For Snowpark DataFrames, distributed transformation is performed using the SQL engine. The fitted transformer generates a Snowpark DataFrame with underlying SQL queries representing the transformed dataset. The
transform
method performs lazy evaluation for simple transforms (for example,StandardScaler
orMinMaxScaler
), so that no transform is actually performed during thetransform
method.However, certain complex transforms involve execution. This includes transformers that require temporary state tables (such as
OneHotEncoder
andOrdinalEncoder
) during transformation. For such a transformer, the transformer creates a temporary table from the Pandas DataFrame (which stores the state of the object) for joins and other operations.Furthermore, when certain parameters are set, for example when the transformer is set to handle unknown values found during transformation by raising errors, the transformer materializes the data, including columns, unknown values, and so forth.
Pandas DataFrames are transformed locally, similar to transformation with scikit-learn. The transformer creates a corresponding scikit-learn transformer using the
to_sklearn
API and performs the transform in memory.
Distributed Hyperparameter Optimization¶
Hyperparameter tuning is an integral part of the data science workflow. The Snowflake API provides distributed
implementations of the scikit-learn GridSearchCV
and RandomizedSearchCV
APIs to enable efficient hyperparameter
tuning on both single-node and multiple-node warehouses.
Tip
Snowflake enables distributed hyperparameter optimization by default. To disable it, use the following Python import.
import snowflake.ml.modeling.parameters.disable_distributed_hpo
The smallest Snowflake virtual warehouse (XS) or Snowpark-optimized warehouse (M) has one node. Each successively larger size doubles the number of nodes.
For single-node (XS) warehouses, the full capacity of the node is utilized by default using scikit-learn’s joblib multiprocessing framework.
Tip
Each fit operation requires its own copy of that training dataset loaded into RAM. To process extremely large datasets, disable distributed hyperparameter
optimization (with import snowflake.ml.modeling.parameters.disable_distributed_hpo
) and set the n_jobs
parameter to 1 to minimize concurrency.
For multiple-node warehouses, the fit
operations within your cross-validation tuning job are distributed across the
nodes. No code changes are required to scale up. Estimator fits are executed in parallel across all available cores on
all nodes in the warehouse.
As an illustration, consider the California housing dataset (https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset) provided with the scikit-learn library. The data includes 20,640 rows of data with the following information:
MedInc: Median income in the block group
HouseAge: Median house age in the block group
AveRooms: Ave number of rooms per household
AveBedrms: Average number of bedrooms per household
Population: The block group population
AveOccup: Average number of household members
Latitude and Longitude
The target of the dataset is the median income, expressed in hundreds of thousands of dollars.
In this example, we do grid search cross-validation on a random forest regressor to find the best hyperparameter combination to predict the median income.
from snowflake.ml.modeling.ensemble.random_forest_regressor import RandomForestRegressor
from snowflake.ml.modeling.model_selection.grid_search_cv import GridSearchCV
from sklearn import datasets
def load_housing_data() -> DataFrame:
input_df_pandas = datasets.fetch_california_housing(as_frame=True).frame
# Set the columns to be upper case for consistency with Snowflake identifiers.
input_df_pandas.columns = [c.upper() for c in input_df_pandas.columns]
input_df = session.create_dataframe(input_df_pandas)
return input_df
input_df = load_housing_data()
# Use all the columns besides the median value as the features
input_cols = [c for c in input_df.columns if not c.startswith("MEDHOUSEVAL")]
# Set the target median value as the only label columns
label_cols = [c for c in input_df.columns if c.startswith("MEDHOUSEVAL")]
DISTRIBUTIONS = dict(
max_depth=[80, 90, 100, 110],
min_samples_leaf=[1,3,10],
min_samples_split=[1.0, 3,10],
n_estimators=[100,200,400]
)
estimator = RandomForestRegressor()
n_folds = 5
clf = GridSearchCV(estimator=estimator, param_grid=DISTRIBUTIONS, cv=n_folds, input_cols=input_cols, label_cols=label_col)
clf.fit(input_df)
This example runs in just over 7 minutes on a Medium (single node) Snowpark-optimized warehouse, and takes just 3 minutes to run on an X-Large warehouse.
Deploying and Running Your Model¶
The result of training a model is a Python model object. You can use the trained model to make predictions by
calling the model’s predict
method. This creates a temporary user-defined function to run the model in your Snowflake
virtual warehouse. This function is automatically deleted at the end of your Snowflake session (for example, when your
script ends or when you close your notebook).
To keep the user-defined function after your session ends, you can create it manually. See the Quickstart (https://github.com/Snowflake-Labs/sfguide-getting-started-machine-learning/blob/main/hol/2_1_DEMO_model_building_scoring.ipynb) on the topic for further information.
The Snowflake model registry also supports persistent models and makes finding and deploying them easier. See Snowflake Model Registry.
Partitioned Custom Models¶
The model registry also supports a special type of custom model where fit and inference are executed in parallel for a set of partitions. This can be a performant way to create many models at once from one dataset and execute inference immediately. Please see Snowflake Model Registry: Partitioned Models for more details.
Pipeline for Multiple Transformations¶
With scikit-learn, it is common to run a series of transformations using a pipeline. scikit-learn pipelines do not work
with Snowflake classes, so a Snowflake version of sklearn.pipeline.Pipeline
is provided for
running a series of transformations. This class is in the snowflake.ml.modeling.pipeline
package, and it works
the same as the scikit-learn version.
Retrieving the Underlying Model¶
Snowflake ML models can be “unwrapped,” that is, converted to the underlying third-party model types, with the following methods (depending on the library):
to_sklearn
to_xgboost
to_lightgbm
All attributes and methods of the underlying model can then be accessed and run locally against the estimator. For example, in the GridSearchCV example, we convert the grid search estimator to a scikit-learn object in order to retrieve the best score.
best_score = grid_search_cv.to_sklearn().best_score_
Known Limitations¶
Snowflake estimators and transformers do not currently support sparse inputs or sparse responses. If you have sparse data, convert it to a dense format before passing it to Snowflake’s estimators or transformers.
The
snowflake-ml-python
package does not currently support matrix data types. Any operation on estimators and transformers that would produce a matrix as a result fails.The order of rows in result data is not guaranteed to match the order of rows in input data.
Snowflake ML does not yet support pandas on Snowflake DataFrames. Convert the Pandas on Snowflake dataframe to a Snowpark dataframe to use it with the Snowflake modeling classes. The following example converts a DataFrame we have read from a Snowflake table:
import modin.pandas as pd import snowflake.snowpark.modin.plugin from snowflake.ml.modeling.xgboost import XGBClassifier snowpark_pandas_df: modin.pandas.DataFrame = read_snowflake('MY_TABLE') # converting to Snowpark DataFrame adds an index column index_label_name = "_INDEX" snowpark_df = snowpark_pandas_df.to_snowpark(index=True, index_label=index_label_name) snowpark_df.show()
The resulting Snowpark DataFrame is as follows:
-------------------------------------------------- |"COLUMN_1" |"COLUMN_2" |"TARGET" | "_INDEX" | -------------------------------------------------- |1 |2 |3 |1 | --------------------------------------------------
The DataFrame can then be used to train the an XGBoost classifier as follows:
# Identify the column names using the Snowflake identifier input_cols = ["COLUMN_1", "COLUMN_2", "COLUMN_3"] # Pass through the _INDEX column rather than using it for training xgb_clf = XGBClassifier(input_cols=input_cols, passthrough_cols=index_label_name, label_cols="TARGET") xgb_clf.fit(snowpark_df)
Troubleshooting¶
Adding More Detail to Logging¶
The Snowflake modeling library uses Snowpark Python’s logging. By default, snowflake-ml-python
logs INFO level messages to standard output. To get
more detailed logs, you can change the level to one of the
supported levels (https://docs.python.org/3/library/logging.html#logging-levels).
DEBUG produces logs with the most details. To set the logging level to DEBUG:
import logging, sys
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
Solutions to Common Issues¶
The following table provides some suggestions for solving possible problems with Snowflake ML Modeling.
Problem or error message |
Possible cause |
Resolution |
---|---|---|
NameError, such as “name x is not defined,” ImportError, or ModuleNotFoundError |
Typographical error in module or class name, or |
Refer to the modeling classes table for the correct module and class name. Ensure that |
KeyError (“not in index” or “none of [Index[..]] are in the [columns]”) |
Incorrect column name. |
Check and correct the column name. |
SnowparkSQLException, “does not exist or not authorize” |
Table does not exist, or you do not have sufficient privileges on the table. |
Ensure that the table exists and that the user’s role has the privileges. |
SnowparkSQLException, “invalid identifier PETALLENGTH” |
Incorrect number of columns (usually a missing column). |
Check the number of columns specified when you created the model class, and ensure that you are passing the right number. |
InvalidParameterError |
An inappropriate type or value has been passed as a parameter. |
Check the class’s or method’s help using the |
TypeError, “unexpected keyword argument” |
Typographical error in named argument. |
Check the class’s or method’s help using the |
ValueError, “array with 0 sample(s)” |
The dataset that was passed in is empty. |
Ensure that the dataset is not empty. |
SnowparkSQLException, “authentication token has expired” |
The session has expired. |
If you’re using a Jupyter notebook, restart the kernel to create a new session. |
ValueError, such as “cannot convert string to float” |
Data type mismatch. |
Check the class’s or method’s help using the |
SnowparkSQLException, “cannot create temporary table” |
A model class is being used inside a stored procedure that doesn’t run with the caller’s rights. |
Create the stored procedure with the caller’s rights instead of with the owner’s rights. |
SnowparkSQLException, “function available memory exceeded” |
Your data set is larger than 5 GB in a standard warehouse. |
Switch to a Snowpark-optimized warehouse. |
OSError, “no space left on device” |
Your model is larger than about 500 MB in a standard warehouse. |
Switch to a Snowpark-optimized warehouse. |
Incompatible xgboost version or error when importing xgboost |
You installed using |
Upgrade or downgrade the package as requested by the error message. |
AttributeError involving |
An attempt to use one of these methods on a model of a different type. |
Use |
Jupyter notebook kernel crashes on an arm-based Mac (M1 or M2 chip): “The Kernel crashed while executing code in the current cell or a previous cell.” |
XGBoost or another library is installed with the incorrect architecture. |
Recreate new conda environment with |
“lightgbm.basic.LightGBMError: (0000) Do not support special JSON characters in feature name.” |
LightGBM doesn’t support double quoted column names in |
Rename the columns in your Snowpark DataFrames. Replacing non-alphanumeric characters with underscores is sufficient in most cases. The Python helper function below may be useful. def fix_values(F, column):
return F.upper(F.regexp_replace(F.col(column), "[^a-zA-Z0-9]+", "_"))
|
Further Reading¶
See the documentation of the original libraries for complete information on their functionality.
Scikit-Learn (https://scikit-learn.org/stable/modules/classes.html)
XGBoost (https://xgboost.readthedocs.io/en/stable/python/index.html)
LightGBM (https://lightgbm.readthedocs.io/en/stable/Python-API.html)
Acknowledgement¶
Some parts of this document are derived from the Scikit-learn documentation, which is licensed under the BSD-3 “New” or “Revised” license and Copyright © 2007-2023 The scikit-learn developers. All rights reserved.
Some parts of this document are derived from the XGboost documentation, which is covered by Apache License 2.0, January 2004 and Copyright © 2019. All rights reserved.
Some parts of this document are derived from the LightGBM documentation, which is MIT-licensed and Copyright © Microsoft Corp. All rights reserved.