Snowpark ML Framework Connectors¶
Snowpark ML includes support for secure, scalable data provisioning for the PyTorch and Tensorflow frameworks, both of which expect data in their own specific formats. To simplify this workflow, the Snowpark ML library provides convenient methods built on top of the FileSet API to provide data from a FileSet as PyTorch Tensors or TensorFlow Records. (A FileSet represents an immutable snapshot of the result of a SQL query in the form of files in an internal server-side encrypted stage.)
Note
This topic assumes that the Snowpark ML module is installed. If it isn’t, see Using Snowflake ML Locally.
Creating a FileSet from a Query¶
Refer to Creating and Using a FileSet for information on creating a FileSet from the data you want to use with PyTorch or TensorFlow. Then continue to one of the following sections:
Feeding a FileSet to PyTorch¶
From a Snowflake FileSet, you can get a PyTorch DataPipe, which can be passed to a PyTorch DataLoader. The DataLoader
iterates over the FileSet data and yields batched PyTorch tensors. Create the DataPipe using the FileSet’s
to_torch_datapipe
method, and then pass the DataPipe to PyTorch’s DataLoader
:
from torch.utils.data import DataLoader
# See later sections about shuffling and batching
pipe = fileset_df.to_torch_datapipe(
batch_size=4,
shuffle=True,
drop_last_batch=True)
for batch in DataLoader(pipe, batch_size=None, num_workers=0):
print(batch)
break
Feeding a FileSet to TensorFlow¶
You can get a TensorFlow Dataset from a Snowflake FileSet using the FileSet’s to_tf_dataset
method:
import tensorflow as tf
# See following sections about shuffling and batching
ds = fileset_df.to_tf_dataset(
batch_size=4,
shuffle=True,
drop_last_batch=True)
for batch in ds:
print(batch)
break
Iterating over the Dataset yields batched tensors.
Shuffling Data in FileSets¶
It is often valuable to shuffle the training data to avoid overfitting and other issues. For a discussion of the value of shuffling, see Why should the data be shuffled for machine learning tasks? (https://datascience.stackexchange.com/questions/24511/why-should-the-data-be-shuffled-for-machine-learning-tasks)
If your query does not already shuffle your data sufficiently, a FileSet can shuffle data at two points:
When the FileSet is created by using
FileSet.make
.All rows in your query are shuffled before they are written to the FileSet. This is a high-quality global shuffle and can be expensive with large datasets. Therefore, it is performed only once, when materializing the FileSet. Pass
shuffle=True
as a keyword argument toFileSet.make
.When you create a PyTorch DataPipe or a TensorFlow Dataset from a FileSet.
At this point, the order of the files in the FileSet is randomized, as is the order of the rows in each file. This can be considered an “approximate” global shuffle. It is of lower quality than a true global shuffle, but it is much less expensive. To shuffle at this stage, pass
shuffle=True
as a keyword argument to the FileSet’sto_torch_datapipe
orto_tf_dataset
method.
For best results, shuffle twice: when creating the FileSet and when feeding the data to PyTorch or TensorFlow.
Batching Data in FileSets¶
FileSets have a batching feature that works the same as the batching functionality in PyTorch and TensorFlow but is
more efficient. Snowflake recommends that you use the batch_size
parameter in the FileSet’s to_torch_datapipe
and to_tf_dataset
methods instead of having PyTorch or TensorFlow do the batching. With PyTorch, to disable its batching functionality, you must
explicitly pass batch_size=None
when instantiating DataLoader
.
You can also drop the last batch if it is incomplete by passing drop_last_batch=True
to to_torch_datapipe
or to
to_tf_dataset
.