snowflake.ml.dataset.DatasetReader

class snowflake.ml.dataset.DatasetReader(session: Session, sources: List[DataSource])

Bases: object

Snowflake Dataset abstraction which provides application integration connectors

Initialize a DatasetVersion object.

Parameters:
  • session – Snowpark Session to interact with Snowflake backend.

  • sources – Data sources to read from.

Raises:

ValueErrorsources arg was empty or null

Methods

files() List[str]

Get the list of remote file paths for the current DatasetVersion.

The file paths follows the snow protocol.

Returns:

A list of remote file paths

Example: >>> dsv.files() —- [“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_0.snappy.parquet”,

“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_1.snappy.parquet”]

filesystem() SnowFileSystem

Return an fsspec FileSystem which can be used to load the DatasetVersion’s files()

to_pandas() DataFrame

Retrieve the DatasetVersion contents as a Pandas Dataframe

to_snowpark_dataframe(only_feature_cols: bool = False) DataFrame

Convert the DatasetVersion to a Snowpark DataFrame.

Parameters:

only_feature_cols – If True, drops exclude_cols and label_cols from returned DataFrame. The original DatasetVersion is unaffected.

Returns:

A Snowpark dataframe that contains the data of this DatasetVersion.

Note: The dataframe generated by this method might not have the same schema as the original one. Specifically,
  • NUMBER type with scale != 0 will become float.

  • Unsupported types (see comments of Dataset.create_version()) will not have any guarantee.

    For example, an OBJECT column may be scanned back as a STRING column.

to_tf_dataset(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) Any

Transform the Snowflake data into a ready-to-use TensorFlow tf.data.Dataset.

Parameters:
  • batch_size – It specifies the size of each data batch which will be yield in the result datapipe

  • shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.

  • drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A tf.data.Dataset that yields batched tf.Tensors.

Examples: >>> dp = dataset.to_tf_dataset(batch_size=1) >>> for data in dp: >>> print(data) —- {‘_COL_1’: <tf.Tensor: shape=(1,), dtype=int64, numpy=[10]>}

to_torch_datapipe(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) Any

Transform the Snowflake data into a ready-to-use Pytorch datapipe.

Return a Pytorch datapipe which iterates on rows of data.

Parameters:
  • batch_size – It specifies the size of each data batch which will be yield in the result datapipe

  • shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.

  • drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.

Returns:

A Pytorch iterable datapipe that yield data.

Examples: >>> dp = dataset.to_torch_datapipe(batch_size=1) >>> for data in dp: >>> print(data) —- {‘_COL_1’:[10]}

Attributes

data_sources
语言: 中文