Engineer features

Snowflake ML allows you to transform your raw data into features, allowing for efficient use by machine learning models. You can transform data using several approaches, each suited for different scales and requirements:

  • Open Source Software (OSS) preprocessors - For small to medium datasets and quick prototyping, use familiar Python ML libraries that run locally or on single nodes within Container Runtime.
  • Snowflake ML Preprocessors - For larger datasets, use Snowflake ML’s preprocessing APIs that execute natively on the Snowflake platform. These APIs distribute the processing across warehouse compute resources.
  • Ray map_batches - For highly customizable large-scale processing, especially with unstructured data, use parallel, resource-managed execution across single-node or multi-node Container Runtime environments.

Choose the approach that best matches your data size, performance requirements, and need for custom transformation logic.

The following table shows detailed comparisons of three main approaches for feature engineering in Snowflake ML:

Feature/AspectOSS (including scikit-learn)Snowflake ML preprocessorsRay map_batches
ScaleSmall & medium datasetsLarge/distributed dataLarge/distributed data
Execution EnvironmentIn memoryPushdown to the default warehouse that you’re using to run SQL queriesAcross nodes in a compute pool
Compute ResourcesSnowpark Container Services (Compute Pool)WarehouseSnowpark Container Services (Compute Pool)
IntegrationStandard Python ML ecosystemIntegrates natively with Snowflake MLBoth with Python ML and Snowflake
PerformanceFast for local, in-memory workloads; scale limited and non-distributedDesigned for scalable, distributed feature engineeringHighly parallel and resource-managed, excels on large/unstructured data
Use Case SuitabilityQuickly prototyping and experimentationProduction workflows with large datasetsLarge data workflows that require custom resource controls

The following examples demonstrate how to implement feature transformations using each approach:

Use the following code to implement scikit-learn for your preprocessing workflows:

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Load your data locally into a Pandas DataFrame
df = pd.DataFrame({
    'age': [34, 23, 54, 31],
    'city': ['SF', 'NY', 'SF', 'LA'],
    'income': [120000, 95000, 135000, 99000]
})

# Define preprocessing steps
numeric_features = ['age', 'income']
numeric_transformer = StandardScaler()

categorical_features = ['city']
categorical_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

# Preprocess the data
X_processed = pipeline.fit_transform(df)
print(X_processed)