snowflake.ml.modeling.preprocessing.OneHotEncoder¶

class snowflake.ml.modeling.preprocessing.OneHotEncoder(*, categories: Union[str, list[Union[numpy.ndarray[Any, numpy.dtype[numpy.int64]], numpy.ndarray[Any, numpy.dtype[numpy.float64]], numpy.ndarray[Any, numpy.dtype[numpy.str_]], numpy.ndarray[Any, numpy.dtype[numpy.bool_]]]], dict[str, Union[numpy.ndarray[Any, numpy.dtype[numpy.int64]], numpy.ndarray[Any, numpy.dtype[numpy.float64]], numpy.ndarray[Any, numpy.dtype[numpy.str_]], numpy.ndarray[Any, numpy.dtype[numpy.bool_]]]]] = 'auto', drop: Optional[Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]] = None, sparse: bool = False, handle_unknown: str = 'error', min_frequency: Optional[Union[int, float]] = None, max_categories: Optional[int] = None, input_cols: Optional[Union[str, Iterable[str]]] = None, output_cols: Optional[Union[str, Iterable[str]]] = None, passthrough_cols: Optional[Union[str, Iterable[str]]] = None, drop_input_cols: Optional[bool] = False)¶

Bases: BaseTransformer

Encode categorical features as a one-hot numeric array.

The feature is converted to a matrix containing a column for each category. For each row, a column is 0 if the category is absent, or 1 if it exists. The categories can be detected from the data, or you can provide them. If you provide the categories, you can handle unknown categories in one of several different ways (see handle_unknown parameter below).

Categories that do not appear frequently in a feature may be consolidated into a pseudo-category called “infrequent.” The threshold below which a category is considered “infrequent” is configurable using the min_frequency parameter.

It is useful to drop one category from features in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping a category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. You can choose from a handful of strategies for specifying the category to be dropped. See drop parameter below.

The results of one-hot encoding can be represented in two ways.

Dense representation creates a binary column for each category. For each row, exactly one column will
contain a 1.
Sparse representation creates a compressed sparse row (CSR) matrix that indicates which columns contain a
nonzero value in each row. As all columns but one contain zeroes, this is an efficient way to represent the results.

The order of input columns are preserved as the order of features.

For more details on what this transformer does, see sklearn.preprocessing.OneHotEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

Parameters:

categories –
‘auto’, list of array-like, or dict {column_name: np.ndarray([category])}, default=’auto’ Categories (unique values) per feature: - ‘auto’: Determine categories automatically from the training data. - list: categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values. - dict: categories[column_name] holds the categories expected in

the column provided. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

The used categories can be found in the categories_ attribute.
drop –
{‘first’, ‘if_binary’} or an array-like of shape (n_features,), default=None Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. - None: retain all features (the default). - ‘first’: drop the first category in each feature. If only one

category is present, the feature will be dropped entirely.
- ’if_binary’: drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.
- array: drop[i] is the category in feature input_cols[i] that should be dropped.
When max_categories or min_frequency is configured to group infrequent categories, the dropping behavior is handled after the grouping.
sparse – bool, default=False Will return a column with sparse representation if set True else will return a separate column for each category.
handle_unknown –
{‘error’, ‘ignore’}, default=’error’ Specifies the way unknown categories are handled during transform(). - ‘error’: Raise an error if an unknown category is present during transform. - ‘ignore’: When an unknown category is encountered during

transform, the resulting one-hot encoded columns for this feature will be all zeros.
min_frequency –
int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If int, categories with a smaller cardinality will be considered

infrequent.
- If float, categories with a smaller cardinality than min_frequency * n_samples will be considered infrequent.
max_categories – int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. If None, there is no limit to the number of output features.
input_cols – Optional[Union[str, List[str]]], default=None The name(s) of one or more columns in the input DataFrame containing feature(s) to be encoded. Input columns must be specified before fit with this argument or after initialization with the set_input_cols method. This argument is optional for API consistency.
output_cols –
Optional[Union[str, List[str]]], default=None The prefix to be used for encoded output for each input column. The number of output column prefixes specified must match the number of input columns. Output column prefixes must be specified before transform with this argument or after initialization with the set_output_cols method.

Note: Dense output column names are case-sensitive and resolve identifiers following Snowflake rules, e.g. “PREFIX_a”, PREFIX_A, “prefix_A”. Therefore, there is no need to provide double-quoted column names as that would result in invalid identifiers.
passthrough_cols – Optional[Union[str, List[str]]] A string or a list of strings indicating column names to be excluded from any operations (such as train, transform, or inference). These specified column(s) will remain untouched throughout the process. This option is helpful in scenarios requiring automatic input_cols inference, but need to avoid using specific columns, like index columns, during training or inference.
drop_input_cols – Optional[Union[str, List[str]]] Remove input columns from output if set True. False by default.

categories_¶: dict {column_name: ndarray([category])} The categories of each feature determined during fitting.

drop_idx_¶

ndarray([index]) of shape (n_features,) - drop_idx_[i] is the index in _categories_list[i] of the category

to be dropped for each feature.

drop_idx_[i] = None if no category is to be dropped from the feature with index i, e.g. when drop=’if_binary’ and the feature isn’t binary.
drop_idx_ = None if all the transformed features will be retained.

If infrequent categories are enabled by setting min_frequency or max_categories to a non-default value and drop_idx[i] corresponds to a infrequent category, then the entire infrequent category is dropped.

infrequent_categories_¶: list [ndarray([category])] Defined only if infrequent categories are enabled by setting min_frequency or max_categories to a non-default value. infrequent_categories_[i] are the infrequent categories for feature input_cols[i]. If the feature input_cols[i] has no infrequent categories infrequent_categories_[i] is None.

See class-level docstring.

Methods

fit(dataset: Union[DataFrame, DataFrame]) → BaseEstimator¶: Runs universal logics for all fit implementations.

get_input_cols() → list[str]¶

Input columns getter.

Returns:: Input columns.

get_label_cols() → list[str]¶

Label column getter.

Returns:: Label column(s).

get_output_cols() → list[str]¶

Output columns getter.

Returns:: Output columns.

get_params(deep: bool = True) → dict[str, Any]¶

Get the snowflake-ml parameters for this transformer.

Parameters:: deep – If True, will return the parameters for this transformer and contained subobjects that are transformers.
Returns:: Parameter names mapped to their values.

get_passthrough_cols() → list[str]¶

Passthrough columns getter.

Returns:: Passthrough column(s).

get_sample_weight_col() → Optional[str]¶

Sample weight column getter.

Returns:: Sample weight column.

get_sklearn_args(default_sklearn_obj: Optional[object] = None, sklearn_initial_keywords: Optional[Union[str, Iterable[str]]] = None, sklearn_unused_keywords: Optional[Union[str, Iterable[str]]] = None, snowml_only_keywords: Optional[Union[str, Iterable[str]]] = None, sklearn_added_keyword_to_version_dict: Optional[dict[str, str]] = None, sklearn_added_kwarg_value_to_version_dict: Optional[dict[str, dict[str, str]]] = None, sklearn_deprecated_keyword_to_version_dict: Optional[dict[str, str]] = None, sklearn_removed_keyword_to_version_dict: Optional[dict[str, str]] = None) → dict[str, Any]¶: Modified snowflake.ml.framework.base.Base.get_sklearn_args with sparse and sparse_output handling.

set_drop_input_cols(drop_input_cols: Optional[bool] = False) → None¶

set_input_cols(input_cols: Optional[Union[str, Iterable[str]]]) → Base¶

Input columns setter.

Parameters:: input_cols – A single input column or multiple input columns.
Returns:: self

set_label_cols(label_cols: Optional[Union[str, Iterable[str]]]) → Base¶

Label column setter.

Parameters:: label_cols – A single label column or multiple label columns if multi task learning.
Returns:: self

set_output_cols(output_cols: Optional[Union[str, Iterable[str]]]) → Base¶

Output columns setter.

Parameters:: output_cols – A single output column or multiple output columns.
Returns:: self

set_params(**params: Any) → None¶

Set the parameters of this transformer.

The method works on simple transformers as well as on sklearn compatible pipelines with nested objects, once the transformer has been fit. Nested objects have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params – Transformer parameter names mapped to their values.
Raises:: SnowflakeMLException – Invalid parameter keys.

set_passthrough_cols(passthrough_cols: Optional[Union[str, Iterable[str]]]) → Base¶

Passthrough columns setter.

Parameters:: passthrough_cols – Column(s) that should not be used or modified by the estimator/transformer. Estimator/Transformer just passthrough these columns without any modifications.
Returns:: self

set_sample_weight_col(sample_weight_col: Optional[str]) → Base¶

Sample weight column setter.

Parameters:: sample_weight_col – A single column that represents sample weight.
Returns:: self

to_lightgbm() → Any¶

to_sklearn() → Any¶

to_xgboost() → Any¶

transform(dataset: Union[DataFrame, DataFrame]) → Union[DataFrame, DataFrame, csr_matrix]¶

Transform dataset using one-hot encoding.

Parameters:

dataset – Input dataset.

Returns:

If input is DataFrame, returns DataFrame
If input is a pd.DataFrame and self.sparse=True, returns csr_matrix
If input is a pd.DataFrame and self.sparse=False, returns pd.DataFrame

Return type:

Output dataset. The output type depends on the input dataset type

Attributes

infrequent_categories_¶: Infrequent categories for each feature.