Snowpark Connect for Spark known differences from Apache Spark¶
Snowpark Connect for Spark lets you run existing PySpark and Spark Connect workloads on Snowflake without rewriting code. There are two categories of behavioral differences to be aware of when migrating workloads:
Spark Connect protocol differences: Snowpark Connect for Spark implements the Spark Connect protocol (https://spark.apache.org/docs/latest/spark-connect-overview.html), not the Spark Classic (in-process) execution model. If you’re migrating from Spark Classic, you’ll encounter behavioral differences inherent to the protocol itself. These differences apply to all Spark Connect implementations, not just Snowpark Connect for Spark. See Spark Connect vs Spark Classic on this page.
Snowflake execution engine differences: Because Snowflake is the underlying execution engine rather than Apache Spark, there are additional semantic and behavioral differences in data types, SQL translation, file I/O, and more.
For detailed API compatibility information, including lists of supported and unsupported APIs, see DataFrame support for Snowpark Connect for Spark and Dataset support for Snowpark Connect for Spark (Java/Scala).
Spark Connect vs Spark Classic¶
Snowpark Connect for Spark uses the Spark Connect protocol (https://spark.apache.org/docs/latest/spark-connect-overview.html), which introduces a gRPC boundary between the client (where you build DataFrame plans) and the server (where plans are analyzed and executed). Spark Classic runs everything in a single JVM process, so the plan is analyzed the moment you construct it. In the Spark Connect architecture, the client sends unresolved plans to the server, and resolution happens only when an action forces it.
This architectural shift has practical consequences for temporary views, error handling, UDF behavior, and schema access. If you’re migrating code that was written for Spark Classic, the issues in this section apply regardless of whether the server is Snowpark Connect for Spark, open-source Spark Connect, or any other Spark Connect implementation.
Deferred plan resolution¶
In Spark Classic, every transformation (filter, select, join) immediately validates
column names, data types, and references against the catalog. An invalid column name raises
AnalysisException on the line where it’s used.
With Spark Connect, the client builds a lightweight representation of the plan without contacting
the server. Validation only happens when an action (.collect(), .show(), .write())
ships the plan to the server. Until then, mistakes in column names or incompatible types go
undetected.
Temporary view rebinding¶
This is the single most common source of subtle bugs when moving from Spark Classic to Spark Connect.
In Spark Classic, createOrReplaceTempView("v") snapshots the DataFrame’s fully resolved plan
into the catalog. Any DataFrame that references v already has the plan baked in, so
overwriting or dropping v later has no effect on existing DataFrames.
In Spark Connect, the client records only the name "v" in the DataFrame’s unresolved plan.
The name is looked up on the server at execution time. This has two consequences:
Overwriting a view changes the definition that existing DataFrames see, which can produce wrong results or duplicate columns.
Dropping a view causes any DataFrame that still references it by name to fail at execution time, even if the DataFrame was created before the view was dropped.
Overwriting a view¶
Consider a pipeline that progressively enriches a dataset:
Dropping a view¶
In Spark Classic, you can drop a temporary view as soon as you’ve built a DataFrame from it, because the plan is already resolved. In Spark Connect, the view must remain available until every DataFrame that references it has been executed:
Defer dropTempView() calls until no outstanding DataFrames depend on the view.
Workarounds¶
Use distinct view names at each stage. Append a counter, timestamp, or UUID to avoid reusing the same name:
Re-read from the view after overwriting it to reset the plan reference:
UDF serialization timing¶
In Spark Classic, a Python UDF’s closure (the external variables it captures) is serialized at registration time. Changing those variables afterward doesn’t affect the UDF.
In Spark Connect, serialization is deferred until the plan is executed. The UDF captures whatever values the external variables hold at that moment, not at definition time. This can produce confusing results when variables are mutated between UDF definition and DataFrame evaluation.
What to do: Bind the value at definition time using a factory function:
Schema access cost¶
df.columns, df.schema, and df.dtypes are free in Spark Classic because the plan is
already resolved locally. In Spark Connect, each of these properties sends the unresolved plan to
the server for analysis and waits for the response. A single call is fine, but using these
properties inside a loop that also creates new DataFrames on every iteration multiplies the
round-trips:
What to do:
Read
df.columnsordf.schemaonce, store the result in a local variable, and reuse it.When inspecting nested types, access the
StructTypefields from the cached schema rather than building intermediate DataFrames just to call.columnson them.
APIs not available in Spark Connect¶
Spark Connect exposes the DataFrame and SQL APIs over gRPC but doesn’t support low-level Spark Classic APIs that require direct access to the driver JVM, executors, or the cluster manager. Code that uses any of the following APIs needs to be rewritten before it can run on Snowpark Connect for Spark or any other Spark Connect endpoint.
Spark Classic API |
Alternative in Spark Connect |
|---|---|
|
Use |
RDD API ( |
Rewrite as DataFrame operations ( |
|
Join against a small DataFrame; the optimizer handles broadcast automatically |
|
Use DataFrame aggregations or write intermediate counts to a table |
|
Stage files and reference them via |
|
Use |
|
Stay in the DataFrame API; use |
|
Use |
|
Not available; use DataFrame-level partitioning instead |
|
Return a |
|
Snowpark ML or third-party libraries |
Structured Streaming ( |
Not supported |
|
Not supported; use batch reads/writes |
JVM interop ( |
Not available; use SQL or DataFrame API equivalents |
Data types and numeric behavior¶
Integral type representation¶
Snowpark Connect for Spark implicitly represents ByteType, ShortType, and IntegerType as LongType
in many contexts. While the Spark Connect protocol carries distinct type tags for each integral
width, Snowflake’s storage and compute engine uses NUMBER(p,0) internally. Snowpark Connect for Spark maps
Snowflake’s reported precisions back to Spark types via integral emulation
(snowpark.connect.integralTypesEmulation), but the mapping is lossy:
NUMBER(19,0)maps toLongType,NUMBER(10,0)toIntegerType,NUMBER(5,0)toShortType,NUMBER(3,0)toByteType.Any other precision (for example,
NUMBER(38,0)from aCOUNT(*)) becomesDecimalType(p,0), not a Spark integral type at all.Arithmetic operations, aggregations, and joins can change the Snowflake-side precision, causing the returned type to shift unexpectedly.
.printSchema()may showlongwhere native Spark showsinteger, ordoublewhere Spark showsfloat.
What to do:
Best solution is to enable snowpark.connect.integralTypesEmulation otherwise try:
Avoid branching on specific integer sub-types (
ByteTypevsIntegerTypevsLongType). Treat all integer types as interchangeable where possible.If precise integral widths matter, use explicit
.cast()calls to normalize output types.
Floating-point precision¶
FloatType and DoubleType are maintained as distinct types in the protocol and Arrow
transport, but Snowflake’s FLOAT and DOUBLE are synonymous internally. Operations that
touch the Snowflake engine may return DoubleType where Spark would return FloatType.
Additionally, floating-point division and aggregation may produce results at higher or lower
default precision than Spark.
What to do:
Use explicit
round()calls where exact precision matching is required.Avoid relying on exact float equality comparisons across systems.
Integer overflow¶
Spark uses fixed-width two’s-complement integers (32-bit int, 64-bit long). Overflow wraps
silently in non-ANSI mode: adding 1 to Integer.MAX_VALUE produces Integer.MIN_VALUE.
Snowflake uses arbitrary-precision NUMBER, which either succeeds with the mathematically
correct result or errors but never wraps.
By default (handleIntegralOverflow=false), Snowpark Connect for Spark doesn’t emulate Spark’s wrapping
semantics. Arithmetic on integral types goes through Snowflake’s NUMBER math and is cast to
the target type. This means:
Operations that would silently wrap in Spark may produce mathematically correct (but Spark-incompatible) results, or may error if the cast overflows Snowflake’s internal representation.
-Long.MIN_VALUEproduces a positive result in Snowflake but wraps back toLong.MIN_VALUEin Spark.abs(Long.MIN_VALUE)returnsLong.MIN_VALUEin Spark (wrapping). In Snowpark Connect for Spark it may produce a positive value or error.
You can opt in to overflow emulation:
With this enabled, Snowpark Connect for Spark emulates two’s-complement wrapping via SQL MOD operations on an
offset range. When combined with spark.sql.ansi.enabled=true, it raises
ArithmeticException on overflow instead of wrapping.
Trade-offs:
Enabling overflow emulation adds SQL complexity to every arithmetic expression (extra
WHEN/OTHERWISEclauses), which has a measurable performance cost.The overflow wrapping logic is not applied to windowed
SUMoperations. Windowed integral sums may produce different results than non-windowed equivalents on the same data when overflow occurs.
Execution model and performance¶
Plan re-execution¶
Each Spark action (.collect(), .show(), .write()) translates to a separate Snowflake
query. Unlike Spark, intermediate results are not automatically cached between actions.
Calling .show() followed by .write() on the same DataFrame executes the full query plan
twice.
What to do:
Call
df.cache()on DataFrames that are used by multiple downstream actions. This materializes the data as a Snowflake temporary table.All Spark storage levels (
MEMORY_ONLY,MEMORY_AND_DISK, and others) map to temp table storage. There is no in-memory-only caching.
Data distribution and parallelism¶
Snowflake manages query parallelism internally through warehouse sizing. Spark’s distribution APIs behave differently:
Spark API |
Behavior in Snowpark Connect for Spark |
|---|---|
|
No effect on query parallelism. For writes, produces N sequential |
|
Ignored. Snowflake’s optimizer determines join strategies. |
|
Ignored. No Snowflake equivalent. |
SQL and function compatibility¶
SQL dialect¶
Snowpark Connect for Spark translates Spark SQL to Snowflake SQL. Most standard SQL works as expected, but some constructs have differences:
Some correlated subquery forms aren’t supported.
Multi-column
UNPIVOThas limitations.Window frames require an explicit
ORDER BYclause for bounded frames. Spark doesn’t always enforce this.Column resolution in
ORDER BYafter column renames may behave differently.
Functions with UDF-based implementations¶
Some Spark built-in functions don’t have a direct Snowflake SQL equivalent. Snowpark Connect for Spark implements these functions as Python or Java UDFs that are registered automatically on first usage. You can call them exactly as you would in Spark, and they produce correct results, but they run slower than native SQL functions because of per-row UDF overhead.
Functions implemented this way include split (regex path), from_csv, to_csv,
map_concat, map_entries, posexplode, inline, xxhash64, higher-order functions,
and XPath functions, among others.
A smaller set of aggregate functions (try_sum, histogram_numeric, percentile,
count_min_sketch) are implemented as Python UDAFs. These are significantly slower than native
aggregates because they serialize and deserialize state for every row.
What to do: If you notice slow performance on a query that uses one of these functions, check whether an equivalent is not available is fully compatible.
Regular expressions¶
Snowflake uses POSIX Extended Regular Expressions (ERE), while Spark uses Java’s
java.util.regex. This affects rlike, regexp_extract, regexp_replace, and the
regex path of split:
Embedded flags (
(?i),(?s)) aren’t supported.Lookaheads and lookbehinds aren’t available.
Unicode character class support differs.
Group index behavior differs (empty string vs. error on non-matching groups).
This isn’t fixable in Snowpark Connect for Spark because it’s a fundamental engine-level difference.
What to do:
Test regex patterns against Snowflake’s regex engine. Rewrite patterns that rely on Java-specific features.
For case-insensitive matching, use
LOWER()/UPPER()instead of(?i).
Structured types (ARRAY, MAP, STRUCT)¶
STRUCT field ordering¶
Snowflake alphabetizes STRUCT fields, while Spark preserves definition order. This can affect hash-based comparisons (for example, SCD type 2 logic) where field sequence matters for reproducible hash values.
What to do:
Compute hashes on individual fields in a deterministic (for example, alphabetical) order rather than hashing the entire STRUCT.
Define STRUCT fields in alphabetical order in both source and target schemas for consistency.
Timestamps and timezones¶
cast(..., TimestampType())behavior depends on Snowflake’sTIMESTAMP_TYPE_MAPPINGaccount parameter, which Snowpark Connect for Spark doesn’t control. This can silently change semantics across Snowflake accounts.Spark’s distinction between
TIMESTAMP_NTZandTIMESTAMP_LTZmaps imperfectly to Snowflake’s timestamp types.Parquet timestamp rebase modes (pre-Gregorian calendar handling) are unsupported. Pre-Gregorian dates may be silently corrupted.
Session timezone interactions between Spark configs and Snowflake session parameters create subtle edge cases.
Timestamp overflow guards exist for microsecond/millisecond/second conversions (
int64range), but the error behavior differs from Spark’sArithmeticException.
What to do:
Explicitly set the session timezone:
spark.conf.set("spark.sql.session.timeZone", "UTC")Use
TIMESTAMP_NTZconsistently when timezone-awareness isn’t needed.Test timestamp-heavy workloads with representative data before production migration.
Session management¶
Session isolation¶
All Spark Connect client sessions share a single underlying Snowflake session. While per-session state (configs, temp views, UDF registrations) is tracked separately in memory, the underlying Snowflake session parameters (warehouse, role, timezone) aren’t isolated per Spark session.
Implications:
Setting a warehouse or role in one session may affect other concurrent sessions.
Objects created in Snowflake may be shared between the spark sessions.
What to do:
For fully isolated workloads, use separate Snowpark Connect for Spark server instances.
Avoid changing Snowflake session parameters (warehouse, role) at runtime if multiple clients share the same endpoint.
Concurrency¶
The gRPC server uses a ThreadPoolExecutor (default 10 workers) to handle concurrent requests.
CPU-bound plan translation holds the Python GIL, which means concurrent requests may serialize
during the translation phase. Query execution on Snowflake is fully parallel.
What to do:
Batch DataFrame operations to reduce the number of server round-trips.
For high-concurrency deployments, size the server appropriately and consider multiple server instances.
Catalog and metadata¶
spark.catalog.* APIs return Snowflake’s metadata model, which differs structurally from
Spark’s:
Snowflake tables don’t have Spark-style partition columns. Partition-based operations (
recoverPartitions,writeTowithoverwritePartitions) fail or behave differently.listDatabases,listTables, andlistFunctionsreturn different metadata structures.Schema nullability metadata isn’t preserved.
refreshTableandrefreshByPathhave no Snowflake equivalent.
Observability and debugging¶
explain()produces Snowflake execution plans, not Spark logical/physical plans.observe()/collect_metricsare no-ops. Spark’s observability framework has no equivalent.interrupt()for cancelling long-running queries isn’t implemented.Error messages originate from Snowflake and may differ in format from Spark’s exceptions. Snowflake may redact values in error messages for security.
What to do:
Use Snowflake’s Query History to monitor and debug query execution.
Set
query_tagvia Spark config to correlate Snowpark Connect for Spark queries with application-level context.
Java and Scala limitations¶
Note
The Java and Scala APIs for Snowpark Connect for Spark are in public preview. See Preview features.
Only Java 11 and Java 17 are supported.
Only Scala 2.12 and Scala 2.13 are supported.
Java/Scala UDTFs and UDAFs aren’t supported.
Interval types aren’t supported inside user-defined functions (
Intervalinside UDFs).