Snowpark Connect for Spark compatibility guide¶
This guide documents the compatibility between the Snowpark Connect for Spark implementation of the Spark DataFrame APIs and native Apache Spark. It is intended to help users understand the key differences, unsupported features, and migration considerations when moving Spark workloads to Snowpark Connect for Spark.
Snowpark Connect for Spark aims to provide a familiar Spark DataFrame API experience on top of the Snowflake execution engine. However, there are the compatibility gaps described in this topic. This guide highlights those differences to help you plan and adapt your migration. These might be addressed in a future release.
DataTypes¶
不支持的数据类型
- DayTimeIntervalType (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html)
- YearMonthIntervalType (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html)
- UserDefinedTypes (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/UserDefinedType.html)
隐式数据类型转换
When using Snowpark Connect for Spark, keep in mind how data types are handled. Snowpark Connect for Spark implicitly represents ByteType,
ShortType, and IntegerType as LongType. This means that while you might define columns or data with
ByteType, ShortType, or IntegerType, the data will be represented and returned by Snowpark Connect for Spark as
LongType. Similarly, implicit conversion might also occur for FloatType and
DoubleType depending on the specific operations and context. The Snowflake execution engine will internally handle data
type compression and may in fact store the data as Byte or Short, but these are considered implementation details and not exposed to the
end user.
从语义上讲,这种表示方式不会影响到 Spark 查询的正确性。
| 来自原生 PySpark 的数据类型 | Data type from Snowpark Connect for Spark |
|---|---|
ByteType | LongType |
ShortType | LongType |
IntegerType | LongType |
LongType | LongType |
The following example shows a difference in how Spark and Snowpark Connect for Spark handle data types in query results.
查询
Spark¶
Snowpark Connect for Spark¶
NullType nuance¶
Snowpark Connect for Spark doesn’t support the NullType (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.NullType.html)
datatype, which is a supported data type in Spark. This causes behavior changes when using Null or None in dataframes.
In Spark, a literal NULL (for example, with lit(None)) is automatically inferred as a NullType. In Snowpark Connect for Spark, it is inferred as a
StringType during schema inference.
Structured data types in ArrayType, MapType, and ObjectType¶
While structured type support is not available by default in Snowpark Connect for Spark, ARRAY, MAP and Object datatypes are
treated as generic, untyped collections. This means there is no enforcement of element types, field names, schema, or nullability, unlike
what would be provided by structured type support.
如果您依赖此支持,请与您的客户团队配合,为您的账户启用此功能。
不支持的 Spark APIs¶
The following are the APIs supported by classic Spark and Spark Connect but not supported in Snowpark Connect for Spark.
- Dataframe.hint (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.hint.html): Snowpark Connect for Spark ignores any hint that is set on a dataframe. The Snowflake query optimizer automatically determines the most efficient execution strategy.
- DataFrame.repartition (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.repartition.html): This is a no-op in Snowpark Connect for Spark. Snowflake automatically manages data distribution and partitioning across its distributed computing infrastructure.
- pyspark.RDD (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html): RDD APIs are not supported in Spark Connect (including Snowpark Connect for Spark).
- pyspark.ml (https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html)
- pyspark streaming (https://spark.apache.org/docs/latest/streaming-programming-guide.html)
UDF 差异¶
StructType differences¶
When Spark converts a StructType to be used in a user-defined function (UDF), it converts it to a tuple type in Python. Snowpark Connect for Spark will convert a
StructType into a dict type in Python. This has fundamental differences in element access and output.
- Spark 将使用 0、1、2、3 这样的形式访问索引。
- Snowpark Connect for Spark will access indexes using ‘_1’, ‘_2’, and so on.
UDFs 中的迭代器类型¶
不支持将迭代器作为返回类型或输入类型。
将文件导入 Python UDF¶
With Snowpark Connect for Spark, you can specify external libraries and files in Python UDFs. Snowflake includes Python files and archives in your code’s execution context. You can import functions from these included files in a UDF without additional steps. This dependency-handling behavior works as described in Creating a Python UDF with code uploaded from a stage.
To include external libraries and files, you provide stage paths to the files as the value of the configuration setting
snowpark.connect.udf.imports. The configuration value should be an array of stage paths to the files, where the paths are
separated by commas.
以下示例代码在 UDF 执行上下文中包含了两个文件。该 UDF 从这些文件中导入函数,并在其逻辑中使用。
You can use the snowpark.connect.udf.imports setting to include other kinds of files as well, such as those with data your code
needs to read. Note that when you do this, your code should only read from the included files; any writes to such files will be lost after
the function’s execution ends.
Lambda 函数限制¶
Lambda 表达式不支持用户定义的函数 (UDFs)。这包括自定义 UDFs 和某些底层实现依赖于 Snowflake UDFs 的内置函数。尝试在 lambda 表达式内部使用 UDF 会导致错误。
使用路径敏感模块
If the Python UDF body imports a module that requires a precise path, you need to take additional steps. When loading dependencies for UDFs, Snowflake puts all of the files in the working directory without preserving the original path. To preserve the original structure, you must zip dependencies and then add as an import for SCOS by using either addArtifacts or configuration snowpark.connect.udf.python.imports.
数据源
| Data source | Compatibility issues compared with PySpark |
|---|---|
| Avro | File type is not supported. |
| CSV | Save mode is not supported for the following: 以下是已知限制:
The following options are not supported: |
| JSON | Save mode not supported for the following: 以下是已知限制:
The following options are not supported: |
| Orc | File type is not supported. |
| Parquet | Save mode is not supported for the following: 以下是已知限制:
The following options are not supported: |
| Text | Save mode is not supported for the following: 以下是已知限制:
|
| XML | Save mode is not supported for the following: 以下是已知限制:
The following options are not supported: |
| Snowflake table | 写入表不需要提供商格式。 不支持分桶和分区。 不支持存储格式和版本控制。 |
目录
Snowflake Horizon 目录提供商支持¶
- 仅支持 Snowflake 作为目录提供商。
不支持的目录 APIs¶
registerFunctionlistFunctionsgetFunctionfunctionExistscreateExternalTable
部分支持的目录 APIs¶
createTable(no external table support)
Iceberg¶
Snowflake 管理的 Iceberg 表¶
Snowpark Connect for Spark 支持 Apache Iceberg™ 表,包括外部管理的 Iceberg 表和目录链接的数据库。
读取
不支持 Time Travel,包括历史快照、分支和增量读取。
写入
- 不支持使用 Spark SQL 创建表。
- 不支持架构合并。
- 要创建该表,必须:
- 创建外部卷。
- 通过以下任一方式将所需的外部卷与表创建相关联:
- 将 EXTERNAL_VOLUME 设置为数据库。
- Set
snowpark.connect.iceberg.external_volumeto Spark configuration.
外部管理的 Iceberg 表¶
读取
- 必须创建 Snowflake 非托管表实体。
- 不支持 Time Travel,包括历史快照、分支和增量读取。
写入
- 不支持创建表。
- 支持写入现有 Iceberg 表。