snowflake.ml.dataset.DatasetReader¶
- class snowflake.ml.dataset.DatasetReader(session: Session, sources: List[DataSource])¶
Bases:
object
Snowflake Dataset abstraction which provides application integration connectors
Initialize a DatasetVersion object.
- Parameters:
session – Snowpark Session to interact with Snowflake backend.
sources – Data sources to read from.
- Raises:
ValueError – sources arg was empty or null
Methods
- files() List[str] ¶
Get the list of remote file paths for the current DatasetVersion.
The file paths follows the snow protocol.
- Returns:
A list of remote file paths
Example: >>> dsv.files() —- [“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_0.snappy.parquet”,
“snow://dataset/mydb.myschema.mydataset/versions/test/data_0_0_1.snappy.parquet”]
- filesystem() SnowFileSystem ¶
Return an fsspec FileSystem which can be used to load the DatasetVersion’s files()
- to_pandas() DataFrame ¶
Retrieve the DatasetVersion contents as a Pandas Dataframe
- to_snowpark_dataframe(only_feature_cols: bool = False) DataFrame ¶
Convert the DatasetVersion to a Snowpark DataFrame.
- Parameters:
only_feature_cols – If True, drops exclude_cols and label_cols from returned DataFrame. The original DatasetVersion is unaffected.
- Returns:
A Snowpark dataframe that contains the data of this DatasetVersion.
- Note: The dataframe generated by this method might not have the same schema as the original one. Specifically,
NUMBER type with scale != 0 will become float.
- Unsupported types (see comments of
Dataset.create_version()
) will not have any guarantee. For example, an OBJECT column may be scanned back as a STRING column.
- Unsupported types (see comments of
- to_tf_dataset(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) Any ¶
Transform the Snowflake data into a ready-to-use TensorFlow tf.data.Dataset.
- Parameters:
batch_size – It specifies the size of each data batch which will be yield in the result datapipe
shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.
- Returns:
A tf.data.Dataset that yields batched tf.Tensors.
Examples: >>> dp = dataset.to_tf_dataset(batch_size=1) >>> for data in dp: >>> print(data) —- {‘_COL_1’: <tf.Tensor: shape=(1,), dtype=int64, numpy=[10]>}
- to_torch_datapipe(*, batch_size: int, shuffle: bool = False, drop_last_batch: bool = True) Any ¶
Transform the Snowflake data into a ready-to-use Pytorch datapipe.
Return a Pytorch datapipe which iterates on rows of data.
- Parameters:
batch_size – It specifies the size of each data batch which will be yield in the result datapipe
shuffle – It specifies whether the data will be shuffled. If True, files will be shuffled, and rows in each file will also be shuffled.
drop_last_batch – Whether the last batch of data should be dropped. If set to be true, then the last batch will get dropped if its size is smaller than the given batch_size.
- Returns:
A Pytorch iterable datapipe that yield data.
Examples: >>> dp = dataset.to_torch_datapipe(batch_size=1) >>> for data in dp: >>> print(data) —- {‘_COL_1’:[10]}
Attributes
- data_sources¶