准备加载数据

This topic provides an overview of supported data file formats and data compression. Depending on your data’s structure, you might need to prepare the data before loading it.

支持的数据类型

See SQL data types reference for descriptions of the data types supported by Snowflake.

数据文件压缩

We recommend that you compress your data files when you are loading large data sets. See CREATE FILE FORMAT for the compression algorithms supported for each data type.

When loading compressed data, Snowflake will automatically determine the file and codec compression method for your data files. The COMPRESSION file format option describes how your data files are already compressed in the stage. Set the COMPRESSION option in one of the following ways:

  • As a file format option specified directly in the COPY INTO <table> statement.
  • As a file format option specified for a named file format or stage object. The named file format/stage object can then be referenced in the COPY INTO <table> statement.

支持的文件格式

支持以下文件格式:

Structured/Semi-structuredTypeNotes
StructuredDelimited (CSV, TSV, etc.)Any valid singlebyte delimiter is supported; default is comma (i.e. CSV).
Semi-structuredJSON
AvroIncludes automatic detection and processing of compressed Avro files.
ORCIncludes automatic detection and processing of compressed ORC files.
ParquetIncludes automatic detection and processing of compressed Parquet files.

Currently, Snowflake supports the schema of Parquet files produced using the Parquet writer v1. Files produced using v2 of the writer are not supported.
XML

文件格式选项指定文件中包含的数据类型,以及有关数据格式的其他相关特征。您可以指定的文件格式选项因计划加载的数据类型而异。Snowflake 提供了一整套文件格式选项默认值。

半结构化文件格式

Snowflake natively supports semi-structured data, which means semi-structured data can be loaded into relational tables without requiring the definition of a schema in advance. Snowflake supports loading semi-structured data directly into columns of type VARIANT (see Semi-structured data types for more details).

目前支持的半结构化数据格式包括 JSON、Avro、ORC、Parquet 或 XML:

  • 对于 JSON、Avro、ORC 和 Parquet 数据,每个高级完整对象都作为单独的行加载到表中。每个对象都可以包含换行符和空格,只要该对象有效。
  • 对于 XML 数据,每个高级元素都作为单独的行加载到表中。元素由同名的开始和结束标记标识。

Typically, tables used to store semi-structured data consist of a single VARIANT column. Once the data is loaded, you can query the data similar to structured data. You can also perform other tasks, such as extracting values and objects from arrays. For more information, see the FLATTEN table function.

Note

半结构化数据可以加载到具有多列的表中,但半结构化数据必须作为字段存储在结构化文件(例如 CSV 文件)中。然后,可以将数据加载到表中的指定列中。

命名文件格式

Snowflake 支持创建命名文件格式,这些文件格式是封装所有必需格式信息的数据库对象。然后,可以在可以指定单个文件格式选项的所有相同位置使用命名文件格式作为输入,从而帮助简化类似格式数据的数据加载过程。

Named file formats are optional, but are recommended when you plan to load similarly formatted data on a regular basis.

创建命名文件格式

You can create a file format using either Snowsight or SQL:

Snowsight:
  1. In the navigation menu, select Catalog » Database Explorer.
  2. 找到数据库并选择要向其添加文件格式的架构。
  3. Select Create » File Format.
  4. Complete the SQL statement and select Create File Format.
SQL:

CREATE FILE FORMAT

For descriptions of all file format options and the default values, see CREATE FILE FORMAT.

支持的复制选项

复制选项确定与错误处理、最大数据大小等相关的数据加载行为。

For descriptions of all copy options and the default values, see COPY INTO <table>.

替换默认文件格式和复制选项

您可以在以下任何位置指定所需的加载行为(即替换默认设置):

In the table definition:

不推荐。

In the named stage definition:

不推荐。

Directly in the COPY INTO TABLE statement when loading data:

Explicitly set the options separately. For more information, see COPY INTO <table>.

Note

Do not specify copy options using the CREATE STAGE, ALTER STAGE, CREATE TABLE, or ALTER TABLE commands. We recommend that you use the COPY INTO <table> command to specify copy options.

如果在多个位置指定了文件格式选项或复制选项,则加载操作将按以下优先顺序应用这些选项:

  1. COPY INTO TABLE 语句。
  2. 暂存区定义。
  3. 表定义。

Note

File format options set in multiple locations are not cumulative. Any options set in one place override all options (whether the same or different options) set lower in the order of precedence.

在多个位置设置的复制选项是可累积的。在一个位置设置的单个选项将替换按优先级顺序设置的相同选项。