Snowflake Catalog SDK¶
The Snowflake Catalog SDK is available for Apache Iceberg™ versions 1.2.0 or later.
With the Snowflake Catalog SDK, you can query Iceberg tables using a third-party engine such as Apache Spark™ or Trino.
Supported catalog operations¶
The SDK supports the following commands for browsing Iceberg metadata in Snowflake:
SHOW NAMESPACES
USE NAMESPACE
SHOW TABLES
USE DATABASE
USE SCHEMA
The SDK currently supports read operations (SELECT statements) only.
Install and connect¶
To install the Snowflake Catalog SDK, download the latest version of the Iceberg libraries (https://iceberg.apache.org/releases/).
Before you can use the Snowflake Catalog SDK, you need a Snowflake database with one or more Iceberg tables. To create an Iceberg table, see Create an Apache Iceberg™ table in Snowflake.
After you establish a connection and the SDK confirms that Iceberg metadata is present, Snowflake accesses your Parquet data using the external volume that is associated with your Iceberg table(s).
Examples using Spark¶
Note
To learn about using Trino with the Snowflake Catalog SDK, see the Trino documentation (https://trino.io/docs/current/object-storage/metastores.html#iceberg-snowflake-catalog).
To read table data with the SDK, start by configuring the following properties for your Spark cluster:
Note
You can use any Snowflake-supported JDBC driver connection parameter
in your configuration by using the following syntax: --conf spark.sql.catalog.snowflake_catalog.jdbc.property-name=property-value
After you configure your Spark cluster, you can check which tables are available to query. For example:
Then you can select a table to query.
You can use the DataFrame structure with languages like Python and Scala to query data.
Note
If you receive vectorized read errors while running queries, you can disable the vectorized reads for your session
by configuring: spark.sql.iceberg.vectorization.enabled=false. To keep using vectorized reads,
you can set the STORAGE_SERIALIZATION_POLICY parameter.
Query caching¶
When you issue a query, Snowflake caches the result within a certain time frame (90 seconds by default).
You might experience latency up to that duration. If you plan to access data programmatically for comparison purposes,
you can set the spark.sql.catalog.cache-enabled property to false to disable caching.
If your application is designed to tolerate a specific amount of latency, you can use the following property
to specify the latency period: spark.sql.catalog.cache.expiration-interval-ms.
Limitations¶
The following limitations apply to the Snowflake Catalog SDK and are subject to change:
The SDK currently supports read operations (SELECT statements) only.
Only Apache Spark and Trino are supported for reading Iceberg tables.
You cannot use the SDK to access non-Iceberg Snowflake tables.