Apache Iceberg tables with Snowpark Connect for Spark

Snowpark Connect for Spark supports reading and writing Apache Iceberg™ tables through the standard Spark DataFrame read/write APIs. The same APIs work for both Snowflake-managed and externally managed Iceberg tables. For general information about Iceberg tables in Snowflake, see Apache Iceberg™ tables.

Setup

Before you can create or write to Iceberg tables, configure an external volume and link it to your session in one of the following ways:

  • Set the EXTERNAL_VOLUME property on the database.

  • Set snowpark.connect.iceberg.external_volume in your Spark configuration:

spark.conf.set("snowpark.connect.iceberg.external_volume", "<your_volume>")

Reading Iceberg tables

Use the standard Spark table reader. The argument to load() is a Snowflake table identifier (for example, my_database.my_schema.my_table), not a file path or catalog URI:

df = spark.read.table("my_iceberg_table")

df = spark.read.format("iceberg").load("my_iceberg_table")

For externally managed tables, you must first create a Snowflake table entity that points to the external catalog before you can read from it. See Create an Apache Iceberg™ table in Snowflake.

Writing with the V1 API (DataFrameWriter)

The V1 DataFrameWriter API is the recommended way to write Iceberg tables. For the full list of modes, see the Spark DataFrameWriter documentation (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.html).

df.write.format("iceberg").saveAsTable("my_iceberg_table")

For dynamic partition overwrite, combine mode("overwrite") with option("overwrite-mode", "dynamic") and partitionBy().

Standard SQL statements such as INSERT INTO and INSERT OVERWRITE also work.

Write options

Option

Description

base_location

Base storage location for table data. Aliases: location, iceberg.base_location.

catalog

Catalog to use (defaults to SNOWFLAKE). Alias: iceberg.catalog.

storage_serialization_policy

Controls how data files are serialized. Alias: iceberg.storage_serialization_policy.

write.target-file-size

Target size for written data files. Valid values: AUTO, 16MB, 32MB, 64MB, 128MB. Alias: target_file_size.

overwrite-mode

Set to dynamic with mode("overwrite") and partitionBy() for dynamic partition overwrite.

mergeSchema

Set to true to enable schema evolution. Can also be set at the session level with spark.conf.set("spark.sql.iceberg.merge-schema", "true").

Writing with the V2 API (DataFrameWriterV2)

The V2 DataFrameWriterV2 API is partially supported. The following operations work: create, append, replace, createOrReplace, overwrite, and overwritePartitions. However, several methods behave differently from open-source Spark.

Tip

For the most predictable behavior, use the V1 DataFrameWriter API for Iceberg writes. The V1 API has full support for write modes, schema evolution, dynamic partition overwrite, and write options.

Snowflake-managed vs. externally managed tables

The read and write APIs are the same for both table types. The differences are in table creation and overwrite behavior:

  • Snowflake-managed tables: You can create tables through Snowpark Connect for Spark using the DataFrame write APIs. See Create an Apache Iceberg™ table in Snowflake.

  • Externally managed tables: An externally managed Iceberg table is one whose metadata lives in an external catalog (for example, AWS Glue or Polaris) rather than in Snowflake’s own catalog. The following applies only to those tables:

    • Create the table outside of Spark. Snowpark Connect for Spark can’t create externally managed tables on your behalf. Create the table once with SQL before any Spark code reads from or writes to it. See Create an Apache Iceberg™ table in Snowflake.

    • Overwrite uses an automatic fallback. A normal Spark overwrite would issue table recreation, which Snowflake rejects on externally managed tables. Snowpark Connect for Spark instead truncates the table and inserts new data. The table definition and its catalog integration are preserved.

    • Schema evolution (mergeSchema):

      • Append: Supported. New columns in your DataFrame are added to the table with ALTER ICEBERG TABLE … ADD COLUMN, provided Snowflake has permission to alter the table through the external catalog.

      • Overwrite: Not supported. The DataFrame schema must already match the table. If it doesn’t, evolve the schema in the external catalog first or use a Snowflake-managed Iceberg table.

Known limitations

  • Using Spark SQL to create Iceberg tables isn’t supported. Use the DataFrame write APIs instead.

  • Schema evolution within STRUCT columns isn’t supported yet. mergeSchema can add new top-level columns but can’t add or modify nested fields inside an existing STRUCT column.

  • Time travel isn’t supported for either table type, including historical snapshot reads, branch reads, and incremental reads.