Query Parquet files directly in your data lake¶
Note
Snowflake also supports loading data from Parquet files directly into a Snowflake-managed Iceberg table. This option is generally available. For more information, see Load data into Apache Iceberg™ tables and Example: Load Iceberg-compatible Parquet files.
This topic covers how to create a read-only Apache Iceberg™ table for Snowflake from Parquet files that you manage in object storage, otherwise known as Parquet Direct. This option allows you to query Parquet data directly in your data lake, so you don’t have to make a copy of the data or pay for data ingestion. Parquet Direct presents multiple benefits:
Cost: Significantly lower cost as compared to full ingestion, and no per-file charge in refresh as compared to external tables
Seamless Syncing: Automatically refresh tables in Snowflake to reflect changes made to files in storage (add, delete, upsert, schema changes)
Hive-style partitioning: Full support for Hive-style partitioning (for example,
key=value), which makes it easy for you to modernize legacy datasetsRead-Only Permission Model: The permission model doesn’t require write access to your storage, enabling its use in security-conscious and regulated verticals
Performance: Iceberg-grade query performance unlike external tables
Note
Parquet Direct tables are read-only access, so you can’t perform the following DML operations on these tables through Snowflake:
Insert
Update
Delete
Instead, you can use other engines to perform DML operations directly on the files in cloud storage. For the full list of limitations in this private preview, see Limitations for querying Parquet files directly in your data lake.
Partitioned tables¶
To improve query performance, we strongly recommend that you partition tables created from Parquet source files by using partition columns. Query response time is faster when Snowflake processes only a small part of the data instead of having to scan the entire data set. An Iceberg table definition can include multiple partition columns, which impose a multi-dimensional structure on the external data.
To partition a table, your data must be organized using logical paths.
When you create an Iceberg table, you define partition columns as expressions that parse the path or filename information stored in the METADATA$FILENAME pseudo-column. A partition consists of all data files that match the path and/or filename in the expression for the partition column.
Snowflake computes and adds partitions based on the defined partition column expressions when an Iceberg table is refreshed. For an example of creating a partitioned Iceberg table, see Example: Create an Iceberg table from Parquet files, specifying a partition column.
Workflow¶
Use the workflow in this section to create an Iceberg table from Parquet source files.
Note
If you store your Parquet files in Amazon S3 or Microsoft Azure, you can create a table that supports automatically refreshing the table data. To learn more, see the following sections:
Step 1: Create an external volume¶
To create an external volume, complete the instructions for your cloud storage service:
Step 2: Create a catalog integration¶
When you create a catalog integration for this use case, you specify the following properties for the catalog integration:
There isn’t a table format
There isn’t a catalog
The tables reside in object storage
Create a catalog integration by using the CREATE CATALOG INTEGRATION command.
To indicate that the catalog integration is for Iceberg tables created from Parquet source files, set the CATALOG_SOURCE
parameter equal to OBJECT_STORE and the TABLE_FORMAT parameter equal to NONE.
Note
Snowflake does not support creating Iceberg tables from Parquet-based table definitions in the AWS Glue Data Catalog.
The following example creates a catalog integration for Parquet files in object storage.
Step 3: Create an Iceberg table¶
Create an Iceberg table by using the CREATE ICEBERG TABLE command.
Example: Create an Iceberg table from Parquet files, specifying data columns
Example: Create an Iceberg table from Parquet files, specifying a partition column
Example: Create an Iceberg table from Parquet files using automatic schema inference
Syntax¶
Where:
For additional inline constraint details, see CREATE | ALTER TABLE … CONSTRAINT.
For additional out-of-line constraint details, see CREATE | ALTER TABLE … CONSTRAINT.
Required parameters¶
table_nameSpecifies the identifier (name) for the table; must be unique for the schema in which the table is created.
In addition, the identifier must start with an alphabetic character and cannot contain spaces or special characters unless the entire identifier string is enclosed in double quotes (for example,
"My object"). Identifiers enclosed in double quotes are also case-sensitive.For more details, see Identifier requirements.
BASE_LOCATION = 'relative_path_from_external_volume'Specifies a relative path from the table’s
EXTERNAL_VOLUMElocation to a directory where Snowflake can access your Parquet files and write table metadata. The base location must point to a directory and cannot point to a single Parquet file.
Optional parameters¶
col_nameSpecifies the column identifier (name). All the requirements for table identifiers also apply to column identifiers.
For more details, see Identifier requirements and Reserved & limited keywords.
Note
In addition to the standard reserved keywords, the following keywords cannot be used as column identifiers because they are reserved for ANSI-standard context functions:
CURRENT_DATECURRENT_ROLECURRENT_TIMECURRENT_TIMESTAMPCURRENT_USER
For the list of reserved keywords, see Reserved & limited keywords.
col_typeSpecifies the data type for the column.
For details about the data types that can be specified for table columns, see Data type mapping and SQL data types reference.
exprString that specifies the expression for the column. When queried, the column returns results derived from this expression.
A column can be a virtual column, which is defined using an explicit expression.
- METADATA$FILENAME:
A pseudo-column that identifies the name of each Parquet data file included in the table, relative to its path on the external volume.
For example:
If the external volume location is
s3://bucket-name/data/warehouse/and theBASE_LOCATIONof the table isdefault_db/schema_name/table_name/, the absolute location of the Parquet file iss3://bucket-name/data/warehouse/default_db/schema_name/table_name/ds=2023-01-01/file1.parquet.As a result, the METADATA$FILENAME for this file is
default_db/schema_name/table_name/ds=2023-01-01/file1.parquet.
CONSTRAINT ...Defines an inline or out-of-line constraint for the specified column(s) in the table.
For syntax details, see CREATE | ALTER TABLE … CONSTRAINT. For more information about constraints, see Constraints.
COLLATE 'collation_specification'Specifies the collation to use for column operations such as string comparison. This option applies only to text columns (VARCHAR, STRING, TEXT, etc.). For more details, see Collation specifications.
MASKING POLICY = policy_nameSpecifies the masking policy to set on a column.
EXTERNAL_VOLUME = 'external_volume_name'Specifies the identifier (name) for the external volume where Snowflake can access your Parquet data files.
You must specify an external volume if you have not set one at the database or schema level. Otherwise, the Iceberg table defaults to the external volume for the schema, database, or account. The schema takes precedence over the database, and the database takes precedence over the account.
CATALOG = 'catalog_integration_name'Specifies the identifier (name) of the catalog integration for this table.
You must specify a catalog integration if you have not set one at the database or schema level. Otherwise, the Iceberg table defaults to the catalog integration for the schema, database, or account. The schema takes precedence over the database, and the database takes precedence over the account.
INFER_SCHEMA = '{ TRUE | FALSE }'Specifies whether to automatically detect and evolve the table schema (based on the fields in the Parquet files) in order to retrieve column definitions and partition values.
TRUE: Snowflake detects the table schema to retrieve column definitions and detect partition values. With this option, Snowflake automatically creates virtual columns for partition values in the Parquet file path.If you manually or automatically refresh the table and the Parquet file schema changes, Snowflake automatically evolves the table schema by creating newly identified columns as visible table columns.
FALSE: Snowflake does not detect the table schema. You must include column definitions in your CREATE ICEBERG TABLE statement.
Default: TRUE if you don’t provide a column definition; otherwise, FALSE.
AUTO_REFRESH = '{ TRUE | FALSE }'Specifies whether the table data will be automatically refreshed. This parameter is required only when you create an Iceberg table from Parquet files that supports automatic refreshes. For more information, see Refresh an Iceberg table automatically for Amazon S3.
PATTERN = '{regex_pattern}'A regular expression pattern string, enclosed in single quotes, specifying the filenames and paths on the external stage to match.
If you manage your Parquet source files in Amazon S3, you can use this parameter to avoid reaching the AWS limit for the number of SNS topics that can be created per account. To avoid reaching this limit, do the following:
Create one SNS topic at the bucket level
Create tables, which each have their own regular expression pattern, to logically group them
For more information on this SNS topic limit, see the AWS documentation (https://docs.aws.amazon.com/general/latest/gr/sns.html#limits_sns_resource).
Tip
For the best performance, don’t apply patterns that filter on a large number of files.
REPLACE_INVALID_CHARACTERS = { TRUE | FALSE }Specifies whether to replace invalid UTF-8 characters with the Unicode replacement character (�) in query results. You can only set this parameter for tables that use an external Iceberg catalog.
TRUEreplaces invalid UTF-8 characters with the Unicode replacement character.FALSEleaves invalid UTF-8 characters unchanged. Snowflake returns a user error message when it encounters invalid UTF-8 characters in a Parquet data file.
If not specified, the Iceberg table defaults to the parameter value for the schema, database, or account. The schema takes precedence over the database, and the database takes precedence over the account.
Default:
FALSE
ROW ACCESS POLICY policy_name ON ( col_name [ , col_name ... ] )Specifies the row access policy to set on a table.
TAG ( tag_name = 'tag_value' [ , tag_name = 'tag_value' , ... ] )Specifies the tag name and the tag string value.
The tag value is always a string, and the maximum number of characters for the tag value is 256.
For information about specifying tags in a statement, see Tag quotas.
COMMENT 'string_literal'Specifies a comment for the column or the table.
Comments can be specified at the column level or the table level. The syntax for each is slightly different.
Partitioning parameters¶
Use these parameters to partition your Iceberg table.
part_col_name col_type AS part_exprDefines one or more partition columns in the Iceberg table.
A partition column must evaluate as an expression that parses the path and/or filename information in the METADATA$FILENAME pseudo-column. A partition consists of all data files that match the path and/or filename in the expression for the partition column.
part_col_nameString that specifies the partition column identifier (i.e. name). All the requirements for table identifiers also apply to column identifiers.
col_typeString (constant) that specifies the data type for the column. The data type must match the result of
part_exprfor the column.part_exprString that specifies the expression for the column. The expression must include the METADATA$FILENAME pseudocolumn.
Iceberg tables currently support the following subset of functions in partition expressions:
List of supported functions:
=,<>,>,>=,<,<=||+,--(negate)*AND,ORNOT
[ PARTITION BY ( part_col_name [, part_col_name ... ] ) ]Specifies any partition columns to evaluate for the Iceberg table.
- Usage:
When querying an Iceberg table, include one or more partition columns in a WHERE clause, for example:
... WHERE part_col_name = 'filter_value'A common practice is to partition the data files based on increments of time; or, if the data files are staged from multiple sources, to partition by a data source identifier and date or timestamp.
Example: Create an Iceberg table from Parquet files, specifying data columns¶
The following example creates an Iceberg table from Parquet files in object storage.
The example specifies the external volume and catalog integration created previously in this workflow,
and provides a value for the required BASE_LOCATION parameter.
Example: Create an Iceberg table from Parquet files, specifying a partition column¶
The following example creates an Iceberg table from Parquet files in object storage and defines a partition column
named sr_returned_date_sk.
Example: Create an Iceberg table from Parquet files using automatic schema inference¶
The following example creates an Iceberg table from Parquet files using automatic schema inference without including a column definition.
Alternatively, you can include a column definition to provide information about certain columns.
Snowflake uses the definition to create those columns, then automatically detects other table columns.
In this scenario, you must specify INFER_SCHEMA = TRUE since you include a column definition.
Refresh the table¶
After you create an Iceberg table from Parquet files, you can refresh the table data. Refreshing synchronizes the table with the most recent changes to your Parquet files in object storage. You can either automatically refresh the table data or manually refresh the table data.
Note
We recommend setting up automatic refresh for the Parquet source files.
Schema evolution¶
With the INFER_SCHEMA parameter equal to TRUE, table refresh synchronizes your table with the following
schema changes to the Parquet source files:
New columns
Type widening for the following scenarios to adhere to the Apache Iceberg specification:
int to long
float to double
decimal(p,s) to decimal(p1,s)
Primitive columns¶
Original Type |
Widened Type |
Notes |
|---|---|---|
int |
long |
NUMBER(10,0) → NUMBER(19,0) |
float |
double |
Both map to Snowflake FLOAT |
decimal(p1, s) |
decimal(p2, s) |
Precision can increase (for example, decimal(5,2) → decimal(7,2)) |
Struct fields¶
Type widening applies recursively to struct fields:
Original Type |
Widened Type |
|---|---|
struct<field:int> |
struct<field:long> |
struct<field:float> |
struct<field:double> |
struct<field:decimal(p1,s)> |
struct<field:decimal(p2,s)> |
The following example shows type widening for a struct field:
Array elements¶
Type widening applies to array element types:
Original Type |
Widened Type |
|---|---|
array<int> |
array<long> |
array<float> |
array<double> |
array<decimal(p1,s)> |
array<decimal(p2,s)> |
Map keys and values¶
Type widening applies to both map keys and values:
Original Type |
Widened Type |
|---|---|
map<int, V> |
map<long, V> |
map<K, float> |
map<K, double> |
map<K, decimal(p1,s)> |
map<K, decimal(p2,s)> |
The following example shows type widening for a struct fields:
Automatically refresh tables¶
For information on how to set up automatic refresh, see the instructions for your cloud provider:
These instructions include a step for creating a table from Parquet files with automated refresh enabled on the table.
Manually refresh tables¶
Important
When auto refresh is enabled on a table, you can’t perform a manual refresh on the table. To perform a manual refresh on the table, auto refresh must be disabled.
After you create an Iceberg table from Parquet files, you can refresh the table data using the ALTER ICEBERG TABLE command.
Where:
relative_pathOptional path to a Parquet file or a directory of Parquet files that you want to refresh.
Note
If you specify a relative path that does not exist, the table refresh proceeds as if no relative path was specified.
Example: Refresh all of the files in a table’s BASE_LOCATION¶
To manually refresh all of the files in the table’s BASE_LOCATION, omit the relative path argument:
Example: Refresh the files in a subpath from the BASE_LOCATION¶
To manually refresh a set of Parquet files in a directory, specify a relative path to that directory from the table’s BASE_LOCATION:
Example: Refresh a particular file¶
To manually refresh a particular Parquet file, specify a relative path to that file from the BASE_LOCATION:
Example: Refresh a particular partition¶
To manually refresh a particular partition, specify a relative path to that partition from the BASE_LOCATION:
Refresh an Iceberg table automatically for Amazon S3¶
If you manage your Parquet source files in Amazon S3, you can create an Iceberg table that uses Amazon SNS (Simple Notification Service) for automatic refresh.
This section provides instructions for creating an Iceberg table that automatically refreshes the Parquet source files.
Prerequisite: Create an Amazon SNS topic and subscription¶
Create an SNS topic in your AWS account to handle all messages for the Snowflake external volume location on your S3 bucket.
Subscribe your target destinations for the S3 event notifications (for example, other SQS queues or AWS Lambda workloads) to this topic. SNS publishes event notifications for your bucket to all subscribers to the topic.
For full instructions, see the SNS documentation (https://aws.amazon.com/documentation/sns/).
Step 1: Subscribe the Snowflake SQS queue to your SNS topic¶
Log in to the AWS Management Console.
From the home dashboard, select Simple Notification Service (SNS).
In the left-hand navigation pane, select Topics.
Locate the topic for your S3 bucket. Note the topic ARN.
Using a Snowflake client, query the SYSTEM$GET_AWS_SNS_IAM_POLICY system function with your SNS topic ARN:
The function returns an IAM policy that grants a Snowflake SQS queue permission to subscribe to the SNS topic.
Return to the AWS Management console. In the left-hand navigation pane, select Topics.
Select the topic for your S3 bucket, then select Edit. The Edit page opens.
Select Access policy - Optional to expand this area of the page.
Merge the IAM policy addition from the SYSTEM$GET_AWS_SNS_IAM_POLICY function results into the JSON document.
To allow S3 to publish event notifications for the bucket to the SNS topic, add an additional policy grant.
For example:
Merged IAM policy:
Select Save changes.
Step 2: Create an external volume with your AWS SNS topic¶
To configure an external volume, complete the instructions for Configure an external volume for Amazon S3.
In Step 4: Creating an external volume in Snowflake, specify the following additional parameter:
AWS_SNS_TOPIC = '<sns_topic_arn>'Specifies the Amazon Resource Name (ARN) of the Amazon SNS topic that handles all messages for your external volume location.
For example:
Step 3: Create a catalog integration¶
Create a catalog integration by using the CREATE CATALOG INTEGRATION command. To indicate that the catalog integration is for
Iceberg tables created from Parquet source files, set the CATALOG_SOURCE
parameter equal to OBJECT_STORE and the TABLE_FORMAT parameter equal to NONE.
Note
Snowflake does not support creating Iceberg tables from Parquet-based table definitions in the AWS Glue Data Catalog.
The following example creates a catalog integration for Parquet files in object storage.
Step 4: Create an Iceberg table¶
Create an Iceberg table by using the CREATE ICEBERG TABLE command, setting the AUTO_REFRESH parameter equal to TRUE.
Example: Create an Iceberg table from Parquet files using automatic schema inference and evolution with auto refresh¶
The following example creates an Iceberg table from Parquet files using:
Automatic schema inference without including a column definition
Automatic schema evolution in auto refresh
Alternatively, you can include a column definition to provide information about certain columns.
Snowflake uses the definition to create those columns, then automatically detects other table columns.
In this scenario, you must specify INFER_SCHEMA = TRUE since you include a column definition.
Troubleshoot¶
To track the status of automatic refreshes for your Iceberg table, use the SYSTEM$ICEBERG_TABLE_AUTO_REFRESH_STATUS function.
For example:
Refresh an Iceberg table automatically for Azure Blob Storage¶
If you manage your Parquet source files in Microsoft Azure, you can create an Iceberg table that uses Azure Event Grid for automatic refresh.
This section provides instructions for creating an Iceberg table that automatically refreshes the Parquet source files.
Supported accounts, APIs, and schemas¶
Snowflake supports the following types of blob storage accounts:
Blob storage
Data Lake Storage Gen2
General-purpose v2
Automatic refresh of your Iceberg table from Parquet files isn’t supported for Microsoft Fabric OneLake. For OneLake Iceberg tables from Parquet files, you must manually refresh a table with ALTER ICEBERG TABLE … REFRESH with the REFRESH parameter.
Note
Only Microsoft.Storage.BlobCreated and Microsoft.Storage.BlobDeleted events trigger the refreshing of the Parquet source files.
Adding new objects to blob storage triggers these events. Renaming a directory or object doesn’t trigger these events. Snowflake
recommends that you only send supported events for Iceberg tables from Parquet files to reduce costs, event noise, and latency.
For cloud platform support, triggering automated refreshes of the Parquet source files using Azure Event Grid messages is supported by Snowflake accounts hosted on Microsoft Azure (Azure).
Snowflake supports the following Microsoft.Storage.BlobCreated APIs:
CopyBlobPutBlobPutBlockListFlushWithCloseSftpCommit
Snowflake supports the following Microsoft.Storage.BlobDeleted APIs:
DeleteBlobDeleteFileSftpRemove
For Data Lake Storage Gen2 storage accounts, Microsoft.Storage.BlobCreated events are triggered when clients use the CreateFile
and FlushWithClose operations. If the SSH File Transfer Protocol (SFTP) is used, Microsoft.Storage.BlobCreated events are triggered with SftpCreate and SftpCommit operations. The CreateFile or SftpCreate API alone does not indicate a commit of a file in the storage account. If the
FlushWithClose or SftpCommit message is not sent, Snowflake does not refresh the Parquet source files.
Snowflake only supports the Azure Event Grid event schema (https://learn.microsoft.com/en-us/azure/event-grid/event-schema); it doesn’t support the CloudEvents schema with Azure Event Grid (https://learn.microsoft.com/en-us/azure/event-grid/cloud-event-schema).
Iceberg tables for Snowflake from Parquet files that you manage in object storage don’t support storage versioning.
Prerequisites¶
Before you proceed, ensure you meet the following prerequisites:
A role that has the CREATE EXTERNAL VOLUME and CREATE ICEBERG TABLE privileges on a schema.
Administrative access to Microsoft Azure. If you aren’t an Azure administrator, ask your Azure administrator to complete the steps in Step 1: Configure the Event Grid subscription.
Step 1: Configure the Event Grid subscription¶
This section describes how to set up an Event Grid subscription for Azure Storage events using the Azure CLI.
Create a resource group¶
An Event Grid topic provides an endpoint where the source (that is, Azure Storage) sends events. A topic is used for a collection of related events. Event Grid topics are Azure resources, and must be placed in an Azure resource group.
Execute the following command to create a resource group:
Where:
<resource_group_name>is the name of the new resource group.<location>is the location, or region in Snowflake terminology, of your Azure Storage account.
Enable the Event Grid resource provider¶
Execute the following command to register the Event Grid resource provider. Note that this step is only required if you have not previously used Event Grid with your Azure account:
Create a storage account for data files¶
Execute the following command to create a storage account to store your data files. This account must be either a Blob storage
(that is, a BlobStorage kind) or GPv2 (that is, a StorageV2 kind) account, because only these two account types support event messages.
Note
If you already have a Blob storage or GPv2 account, you can use that account instead.
For example, create a Blob storage account:
Where:
<resource_group_name>is the name of the resource group you created in Create a resource group.<storage_account_name>is the name of the new storage account.<location>is the location of your Azure Storage account.
Create a storage account for the storage queue¶
Execute the following command to create a storage account to host your storage queue. This account must be a GPv2 account, because only this kind of account supports event messages to a storage queue.
Note
If you already have a GPv2 account, you can use that account to host both your data files and your storage queue.
For example, create a GPv2 account:
Where:
<resource_group_name>is the name of the resource group you created in Create a resource group.<storage_account_name>is the name of the new storage account.<location>is the location of your Azure Storage account.
Create a storage queue¶
A single Azure Queue Storage queue can collect the event messages for many Event Grid subscriptions. For best performance, Snowflake recommends creating a single storage queue to accommodate all of your subscriptions related to Snowflake.
Execute the following command to create a storage queue. A storage queue stores a set of messages, in this case event messages from Event Grid:
Where:
<storage_queue_name>is the name of the new storage queue.<storage_account_name>is the name of the storage account you created in Create a storage account for the storage queue.
Export the storage account and queue IDs for reference¶
Execute the following commands to set environment variables for the storage account and queue IDs that will be requested later in these instructions:
Linux or macOS:
Windows:
Where:
<data_storage_account_name>is the name of the storage account you created in Create a storage account for data files.<queue_storage_account_name>is the name of the storage account you created in Create a storage account for the storage queue.<resource_group_name>is the name of the resource group you created in Create a resource group.<storage_queue_name>is the name of the storage queue you created in Create a storage queue.
Install the Event Grid extension¶
Execute the following command to install the Event Grid extension for Azure CLI:
Create the Event Grid subscription¶
Execute the following command to create the Event Grid subscription. Subscribing to a topic informs Event Grid which events to track:
Linux or macOS:
Windows:
Where:
storageidandqueueidare the storage account and queue ID environment variables you set in Export the storage account and queue IDs for reference.<subscription_name>is the name of the new Event Grid subscription.
Step 2: Create a notification integration¶
A notification integration is a Snowflake object that provides an interface between Snowflake and a third-party cloud message queuing service such as Azure Event Grid.
Note
A single notification integration supports a single Azure Storage queue. Referencing the same storage queue in multiple notification integrations can result in missing data in target tables because event notifications are split between notification integrations.
Retrieve the storage queue URL and tenant ID¶
Sign in to the Microsoft Azure portal.
Navigate to Storage account » Queue service » Queues. Record the URL for the queue you created in Create a storage queue for reference later. The URL has the following format:
Navigate to Azure Active Directory » Properties. Record the Tenant ID value for reference later. The directory ID, or tenant ID, is needed to grant Snowflake access to the Event Grid subscription.
Create the notification integration¶
Create a notification integration by using the CREATE NOTIFICATION INTEGRATION command.
Note
Only account administrators (users with the ACCOUNTADMIN role) or a role with the global CREATE INTEGRATION privilege can execute this SQL command.
The Azure service principal for notification integrations is different from the service principal created for storage integrations.
Where:
<integration_name>is the name of the new integration.<queue_URL>and<directory_ID>are the queue URL and tenant ID you recorded in Retrieve the storage queue URL and tenant ID.
For example:
Grant Snowflake access to the storage queue¶
Execute the DESCRIBE INTEGRATION command to retrieve the consent URL:
Where:
<integration_name>is the name of the integration you created in Create the notification integration.
Note the values in the following columns:
AZURE_CONSENT_URL:URL to the Microsoft permissions request page.
AZURE_MULTI_TENANT_APP_NAME:Name of the Snowflake client application created for your account. In a later step in this section, you will need to grant this application the permissions necessary to obtain an access token on your allowed topic.
In a web browser, navigate to the URL in the
AZURE_CONSENT_URLcolumn. The page displays a Microsoft permissions request page.Select Accept. This action allows the Azure service principal created for your Snowflake account to obtain an access token on any resource inside your tenant. Obtaining an access token succeeds only if you grant the service principal the appropriate permissions on the container (see the next step).
The Microsoft permissions request page redirects to the Snowflake corporate site (snowflake.com).
Sign in to the Microsoft Azure portal.
Navigate to Azure Active Directory » Enterprise applications. Verify that the Snowflake application identifier you recorded in Step 1 in this section is listed.
Important
If you delete the Snowflake application in Azure Active Directory at a later time, the notification integration stops working.
Navigate to Queues »
<storage_queue_name>, where<storage_queue_name>is the name of the storage queue you created in Create a storage queue.Select Access Control (IAM) » Add role assignment.
Search for the Snowflake service principal. This is the identity in the
AZURE_MULTI_TENANT_APP_NAMEproperty in the DESC NOTIFICATION INTEGRATION output (in Step 1). Search for the string before the underscore in theAZURE_MULTI_TENANT_APP_NAMEproperty.Important
It can take an hour or longer for Azure to create the Snowflake service principal requested through the Microsoft request page in this section. If the service principal is not available immediately, we recommend waiting an hour or two and then searching again.
If you delete the service principal, the notification integration stops working.
Grant the Snowflake app the following permissions:
Role:
Storage Queue Data Message Processor(the minimum required role), orStorage Queue Data Contributor.Assign access to:
Azure AD user, group, or service principal.Select: The
appDisplayNamevalue.
The Snowflake application identifier should now be listed under
Storage Queue Data Message ProcessororStorage Queue Data Contributor(on the same dialog).
Step 3: Create an external volume with your Azure storage queue¶
To configure an external volume, complete the instructions for Configure an external volume for Azure.
In Step 1: Create an external volume in Snowflake, specify the following additional parameter:
AZURE_STORAGE_QUEUE_PRIMARY_URI = '<queue_URL>'Specifies the URL of the Azure Storage queue that handles all messages for your external volume location.
For example:
Step 4: Create a catalog integration¶
Create a catalog integration by using the CREATE CATALOG INTEGRATION command. To indicate that the catalog integration is for
Iceberg tables created from Parquet source files, set the CATALOG_SOURCE
parameter equal to OBJECT_STORE and the TABLE_FORMAT parameter equal to NONE.
Note
Snowflake does not support creating Iceberg tables from Parquet-based table definitions in the Azure environment.
The following example creates a catalog integration for Parquet files in object storage.
Step 5: Create an Iceberg table¶
Create an Iceberg table by using the CREATE ICEBERG TABLE command, setting the AUTO_REFRESH parameter equal to TRUE.
Example: Create an Iceberg table from Parquet files using automatic schema inference with auto refresh¶
The following example creates an Iceberg table from Parquet files using:
Automatic schema inference without including a column definition
Auto refresh
Alternatively, you can include a column definition to provide information about certain columns.
Snowflake uses the definition to create those columns, then automatically detects other table columns.
In this scenario, you must specify INFER_SCHEMA = TRUE since you include a column definition.
Troubleshoot¶
To track the status of automatic refreshes for your Iceberg table, use the SYSTEM$ICEBERG_TABLE_AUTO_REFRESH_STATUS function.
For example:
Data type mapping¶
When you define a column in a CREATE ICEBERG TABLE statement for Parquet source files, you must specify a Snowflake data type that maps to the Parquet data type used in your source files.
Note
In addition to data types that are compatible with Iceberg, the following non-Iceberg data types are also supported:
BYTE_ARRAY
INT96
For more information, see the data type mapping table below.
The following table shows how Parquet logical types map to physical types, and how the physical types map to Snowflake data types.
Parquet logical type |
Parquet physical type |
Snowflake data type |
|---|---|---|
None |
BOOLEAN |
BOOLEAN |
None INT(bitWidth=8, isSigned=true) INT(bitWidth=16, isSigned=true) INT(bitWidth=32, isSigned=true) |
INT32 |
INT |
None, INT(bitWidth=64, isSigned=true) |
INT64 |
BIGINT |
None |
FLOAT |
FLOAT |
None |
DOUBLE |
FLOAT |
DECIMAL(P,S) |
INT32 INT64 FIXED_LEN_BYTE_ARRAY(N) |
DECIMAL(P,S) |
DATE |
INT32 |
DATE |
TIME(isAdjustedToUTC=true, unit=MILLIS) |
INT32 |
TIME(3) |
TIME(isAdjustedToUTC=true, unit=MICROS) |
INT64 |
TIME(6) |
TIME(isAdjustedToUTC=true, unit=NANOS) |
INT64 |
TIME(9) |
NONE |
INT96 |
TIMESTAMP_LTZ(9) |
TIMESTAMP(isAdjustedToUTC=true, unit=MILLIS) |
INT64 |
TIMESTAMP_NTZ(3) |
TIMESTAMP(isAdjustedToUTC=true, unit=MICROS) |
INT64 |
TIMESTAMP_NTZ(6) |
TIMESTAMP(isAdjustedToUTC=true, unit=NANOS) |
INT64 |
TIMESTAMP_NTZ(9) |
STRING |
BYTE_ARRAY |
VARCHAR |
ENUM |
BYTE_ARRAY |
VARCHAR |
JSON |
BYTE_ARRAY |
VARCHAR |
UUID |
FIXED_LEN_BYTE_ARRAY(16) |
BINARY(16) |
NONE |
FIXED_LEN_BYTE_ARRAY(N) |
BINARY(L) |
NONE BSON |
BYTE_ARRAY |
BINARY |
INTERVAL |
FIXED_LEN_BYTE_ARRAY(12) |
BINARY(12) Snowflake does not support a corresponding data type for the Parquet INTERVAL type, and reads the data from source files as binary data. |
The following table shows how Parquet nested data types map to Snowflake data types.
Parquet logical nested type |
Snowflake data type |
|---|---|
NONE |
|
LIST |
|
MAP |
Limitations for querying Parquet files directly in your data lake¶
By default, the maximum number of Parquet files that you can use to create an Iceberg table is ~2 million.
To use more than this limit, contact Snowflake Support for assistance.
Parquet files that use any of the following features or data types are not supported:
Field IDs.
The DECIMAL data type with precision higher than 38.
LIST or MAP types with one-level or two-level representation.
Unsigned integer types (INT(signed=false)).
The FLOAT16 data type.
Snowflake does not support creating Iceberg tables from Parquet-based table definitions in the AWS Glue Data Catalog.
Generating Iceberg metadata using the SYSTEM$GET_ICEBERG_TABLE_INFORMATION function is not supported.
When auto refresh is enabled on a table, you can’t perform a manual refresh on the table. To perform a manual refresh on the table, auto refresh must be disabled.
For the read-only Iceberg tables that you create from Parquet files:
You can’t generate Iceberg metadata for these tables.
You can’t convert these tables to Snowflake-managed Iceberg tables.
You can’t perform DML operations on these tables; they are read only.
You can’t perform table maintenance on these tables.
The following Snowflake features aren’t currently supported for read-only Iceberg tables that you create from Parquet files:
Cloning
Replication
Change tracking
Dynamic tables
Data sharing
Snowflake Native Apps