File I/O with Snowpark Connect for Spark¶
With Snowpark Connect for Spark, you can read and write data in several file formats using the standard Spark DataFrame reader and writer APIs. You can use internal stages, external stages, and cloud storage locations as sources and destinations. For how to configure stages and cloud URIs, see External data sources with Snowpark Connect for Spark.
When you set Spark-style options (such as header, delimiter, or dateFormat), Snowpark Connect for Spark translates them into the
corresponding Snowflake file format options before executing the operation. The
tables in the Format-specific options section show exactly how each Spark option maps to its Snowflake counterpart.
Supported formats¶
Format |
Read |
Write |
|---|---|---|
CSV |
Supported |
Supported |
JSON |
Supported |
Supported |
Parquet |
Supported |
Supported |
Text |
Supported |
Supported |
XML |
Supported |
Not supported |
Avro |
Not supported |
Not supported |
ORC |
Not supported |
Not supported |
Reading data¶
Use the standard Spark read API with format short-hands or .format(). Paths can be Snowflake stage notation
(for example, @my_stage/path/), cloud URIs configured for your account, or local file paths on the machine
where Snowpark Connect for Spark is running. Chain .option() calls to control format behavior.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Reading files with SQL identifiers¶
In addition to the DataFrameReader API, you can read files directly from SQL using Spark’s format-prefix
identifier syntax with spark.sql():
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Supported prefixes: csv.`...`, json.`...`, parquet.`...`, text.`...`.
The path inside the backticks is treated the same as it would be by spark.read.<format>(path).
Snowflake stage paths (@stage/...), cloud URIs, and local paths all work. Default format options
apply (no header for CSV, automatic schema inference, and so on). If you need non-default options,
use spark.read instead because the SQL identifier form doesn’t accept per-call options.
You can combine these file reads with the rest of your SQL: joins, aggregations, CTEs, and subqueries all work on top of them.
Writing data¶
Use the DataFrame write API with the same format short-hands.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Reading from and writing to Snowflake tables¶
You can treat Snowflake tables as Spark DataFrame sources and sinks.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Save modes¶
Set mode on the writer to control how writes interact with existing data. Not all modes are available for every
format or destination type.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Mode |
Spark name |
Behavior in Snowpark Connect for Spark |
|---|---|---|
Error if exists |
|
Default. Fails if data already exists at the target path or table. |
Overwrite |
|
Removes existing files at the target path before writing. For table writes, replaces the table. |
Append |
|
Adds new files alongside existing ones. Uses a random filename prefix to avoid conflicts. |
Ignore |
|
Skips the write if data already exists. Supported for Parquet file writes and table writes only. CSV, JSON, and text file writes raise an error for this mode. |
Controlling output file count and size¶
By default, Snowflake decides how to split your output into files. You can control this with standard Spark APIs and Snowpark Connect for Spark-specific options.
coalesce(1) for a single output file¶
The most common Spark idiom for producing one output file is honored:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
When coalesce(1) is used without partitionBy, Snowpark Connect for Spark routes the write into a single
Snowflake file.
repartition(n) for multiple output files¶
When you request n output files, Snowpark Connect for Spark produces n files with Spark-compatible names:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Rows are distributed across the n target files so each file gets roughly the same amount of
data. When you combine repartition(n) with partitionBy, the per-partition directory layout
takes precedence and Snowflake controls the file count inside each col=value/ directory.
single and snowflake_max_file_size write options¶
Snowpark Connect for Spark provides two additional write options for explicit control over output layout:
Option |
Description |
|---|---|
|
Set to |
|
Maximum size in bytes per output file. Larger values reduce the file count; smaller values produce more files. When single-file behavior is active and this option isn’t set, Snowpark Connect for Spark defaults the cap to 1 GB. |
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Compression¶
Set compression through .option("compression", "<codec>"). For writes, the default is NONE for CSV, JSON,
and text, and SNAPPY for Parquet. For reads, compression is auto-detected if not specified.
For best performance, use splittable compression formats such as BZ2 or SNAPPY. Splittable formats allow
Snowflake to decompress and process file chunks in parallel, which is significantly faster for large files than
non-splittable formats like GZIP.
Snowpark Connect for Spark normalizes codec names: UNCOMPRESSED becomes NONE, and Spark’s BZIP2 becomes BZ2
for CSV, JSON, and text. Compression can also be inferred from file extensions (.gz, .bz2,
.snappy, .deflate).
CSV / JSON / Text |
Parquet |
|---|---|
GZIP, BZ2, BROTLI, ZSTD, DEFLATE, RAW_DEFLATE, NONE |
SNAPPY, LZO, NONE |
Note
Compression isn’t supported for XML reads.
Parallel file reads¶
Multiple files¶
When you read from a directory or pass a list of file paths, Snowpark Connect for Spark reads the files in parallel automatically. No additional configuration is required.
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Single large file¶
Snowpark Connect for Spark can also split a single large file into chunks and read them in parallel, so one large file doesn’t become a bottleneck. This is supported for CSV, JSON, and XML.
CSV: The file must be uncompressed (compression set to none) and multiLine must be
false (the default):
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
JSON: The file must be uncompressed (compression set to none) or BZ2-compressed, and multiLine
must be false (the default):
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
XML: Parallel reads are enabled by default:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Automatic exclusion of metadata files¶
When you read from a directory, Snowpark Connect for Spark automatically skips files that Spark also skips, so you don’t have to filter them out manually:
Excluded pattern |
Reason |
|---|---|
|
Spark write-completion marker |
|
Parquet metadata sidecars |
|
Hadoop checksum files |
|
macOS and hidden files |
Any file starting with |
General Hadoop convention |
This exclusion applies to CSV, JSON, Parquet, and XML reads at any directory depth, including inside
partition subdirectories (for example, year=2025/_SUCCESS is skipped).
Snowpark Connect for Spark also anchors read paths so that files outside your requested prefix are never picked up.
For example, reading from @stage/sales/ won’t accidentally include files from @stage/sales_archive/.
If you set pathGlobFilter to a pattern that explicitly matches hidden or metadata files
(for example, _*), that pattern takes precedence and those files are included.
Format-specific options¶
Spark options set through .option() are translated into Snowflake
file format options before Snowpark Connect for Spark executes the read or write
operation. The tables below show how each Spark option maps to its Snowflake counterpart and note any behavioral
differences.
Options that don’t have a Snowflake equivalent are ignored. Snowpark Connect for Spark logs a warning for each unsupported option.
CSV options¶
Spark option |
Snowflake file format option |
Notes |
|---|---|---|
|
|
Default |
|
|
When |
|
|
Default |
|
|
A single |
|
|
Spark defaults to |
|
|
Default |
|
|
Default |
|
|
Default |
|
|
Java |
|
|
Default |
|
|
See Compression. |
|
|
Default |
|
|
Read only. Default 20000. Sets the number of rows sampled per file when inferring the
schema. Reduce this value to speed up reads when the data is uniform. Requires
|
|
|
Controls error handling on read. |
|
|
If either option is |
|
(internal) |
When |
|
(internal) |
When |
|
|
Filters which files to read based on a glob pattern. |
In addition, Snowpark Connect for Spark always sets the following on CSV reads:
ESCAPE_UNENCLOSED_FIELD = NONEERROR_ON_COLUMN_COUNT_MISMATCH = FalseSKIP_BLANK_LINES = True
JSON options¶
Spark option |
Snowflake file format option |
Notes |
|---|---|---|
|
|
Both Snowflake options are set to the same value. When |
|
|
Default |
|
|
Default |
|
|
Default |
|
|
Controls error handling on read. |
|
(schema inference) |
When |
|
|
Read only. Sets the number of rows sampled per file when inferring the schema. Reduce this value to speed up reads when the data is uniform. |
|
(internal) |
When |
|
|
Filters which files to read. |
|
|
See Compression. |
When writing JSON, Snowpark Connect for Spark converts the DataFrame into a single VARIANT column using OBJECT_CONSTRUCT
before unloading. On the write path, nullValue is mapped to NULL_IF and compression is mapped to
COMPRESSION. Other Spark JSON writer options (such as dateFormat) aren’t applied.
Parquet options¶
Spark option |
Snowflake file format option |
Notes |
|---|---|---|
|
|
Filters which files to read. |
|
|
Default |
|
(internal) |
When |
|
(internal) |
Read only. Default 20000. Controls the sample size used by Snowpark Connect for Spark when discovering complex types (STRUCT, MAP, ARRAY) inside Parquet VARIANT columns. |
|
|
Write only. Sets the string that represents null values in the output files. |
Snowpark Connect for Spark also sets these options automatically based on session configuration:
BINARY_AS_TEXT = False(always)USE_LOGICAL_TYPE: controlled bysnowpark.connect.parquet.useLogicalType
When writing Parquet, structured complex types (ARRAY, MAP, STRUCT) are cast to VARIANT before
unload so that Snowflake’s COPY INTO can produce valid Parquet files.
Text options¶
Text I/O uses a CSV file format internally with FIELD_DELIMITER = NONE.
Spark option |
Snowflake file format option |
Notes |
|---|---|---|
|
|
Default |
|
|
When |
|
|
Default |
Text writes always set ESCAPE_UNENCLOSED_FIELD = NONE and FILE_EXTENSION = txt. The DataFrame must
contain exactly one string column.
XML options (read only)¶
XML supports schema inference. If no schema is provided, Snowpark Connect for Spark infers the schema from the data by reading the files and merging field types across all input.
Spark option |
Internal mapping |
Notes |
|---|---|---|
|
|
Specifies the XML element that maps to a DataFrame row. Defaults to |
|
|
Default |
|
|
Default |
|
|
Default |
|
|
Default |
|
|
Default |
|
|
Default |
|
|
Renamed internally. Default |
|
|
Renamed internally. Default |
|
|
Renamed internally. Default |
|
|
Default |
|
|
Default |
|
|
Filters which files to read. |
|
|
Path to an XSD file on a stage (for example, |
Date and timestamp format conversion¶
For CSV reads and writes, Snowpark Connect for Spark automatically converts Java SimpleDateFormat patterns (used by Spark)
to Snowflake file format tokens. Common conversions:
Java/Spark pattern |
Snowflake token |
Example |
|---|---|---|
|
|
Four-digit year |
|
|
Two-digit year |
|
|
Zero-padded month |
|
|
Abbreviated month name (Jan, Feb) |
|
|
Zero-padded day of month |
|
|
24-hour clock (00-23) |
|
|
12-hour clock (01-12) |
|
|
Minutes |
|
|
Seconds |
|
|
Fractional seconds (milliseconds) |
|
|
Fractional seconds (microseconds) |
|
|
AM/PM marker |
Some patterns don’t have exact Snowflake equivalents:
Unpadded values: Spark’s single-letter patterns (
dfor day,hfor hour) produce unpadded output (for example,9), but Snowflake always zero-pads (DDproduces09).Full day name: Spark’s
EEEE(Thursday) maps to abbreviatedDY(Thu) in Snowflake.Timezone offsets: Spark’s
Z(-0500) maps toTZHTZM(-0500) without a colon separator. The colon-separatedTZH:TZMformat (-05:00) applies toXXXandxxxpatterns.
Important
This conversion applies only to CSV. For JSON, date and timestamp format strings are passed directly to
Snowflake without conversion. Use Snowflake-compatible format tokens (such as YYYY-MM-DD) when setting
dateFormat or timestampFormat on JSON readers.
Partitioned data¶
Snowpark Connect for Spark supports Hive-style col=value/ directory layouts for CSV, JSON, and Parquet formats.
Writing partitioned data¶
Use the standard partitionBy(...) API. Snowpark Connect for Spark produces the same col=value/ directory tree that Spark
does:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Note
Unlike open-source Spark, Snowpark Connect for Spark includes the partition columns in the written data files themselves, not only in the directory structure. The output files contain all columns of the DataFrame, including the ones used for partitioning.
Reading partitioned data¶
Point the reader at the root directory and Snowpark Connect for Spark discovers partition columns from the directory names automatically:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
Partition discovery behavior:
Partition columns appear at the end of the schema, in the order they appear in the directory tree.
Partition value types are inferred from the observed values: all integers become
IntegerType, all floating-point values becomeDoubleType, otherwiseStringType. You can override a partition column’s type by supplying a.schema(...)that includes it.nullvalues in a partition column are written as the directory segment__HIVE_DEFAULT_PARTITION__and read back asnull.Mixed-depth or conflicting layouts (for example, the same key at two different depths) raise an error rather than producing incorrect values.
Important
Filters on partition columns don’t reduce the number of files read. Snowpark Connect for Spark reads all files in the directory subtree and applies filters after loading.
Dynamic partition overwrite¶
When writing partitioned data to a stage, you can overwrite only the partitions that appear in the current
DataFrame instead of deleting all existing data. Set the spark.sql.sources.partitionOverwriteMode session
configuration to dynamic:
Note
The Java client for Snowpark Connect for Spark is a preview feature.
Note
The Scala client for Snowpark Connect for Spark is a preview feature.
This removes only the partition subdirectories that match the DataFrame. See Snowpark Connect for Spark properties for more on this configuration property.
Note
The .option("overwrite-mode", "dynamic") writer option is supported only for
Iceberg table writes. For stage-based file writes, use the session configuration shown above.
Known limitations¶
Save mode support varies by format:
ignoremode is supported only for Parquet file writes and table writes. CSV, JSON, and text file writes raise an error forignoremode.ORC and Avro: These formats aren’t supported for read or write.
Bucketed writes:
bucketByandsortByaren’t supported for file or table writes.One-sided whitespace trimming: Spark’s
ignoreLeadingWhiteSpaceandignoreTrailingWhiteSpaceboth map toTRIM_SPACE, which trims from both sides. Trimming only leading or only trailing whitespace isn’t possible.CSV quote semantics: Spark’s
quoteoption and Snowflake’sFIELD_OPTIONALLY_ENCLOSED_BYhave different semantics. Snowflake treats enclosure as optional, while Spark applies mandatory quoting rules.Text column constraint: Text writes require exactly one string column in the DataFrame.
XML MapType:
MapTypeisn’t supported for XML reads. UseStructTypeto represent key-value structures.XML via Spark SQL: Reading XML files with
spark.sql()(for example,SELECT * FROM xml.`@stage/file.xml`) isn’t supported. Use theDataFrameReaderAPI instead.