Query data in staged files

Snowflake 支持使用标准 SQL 查询位于内部(即 Snowflake)暂存区或 命名 外部(Amazon S3、Google Cloud Storage 或 Microsoft Azure)暂存区的数据文件。这对于检查/查看暂存文件的内容非常有用,尤其是在加载数据之前或卸载数据之后。

In addition, by referencing metadata columns in a staged file, a staged data query can return additional information, such as filename and row numbers, about the file.

Snowflake utilizes support for staged data queries to enable transforming data during loading.

Note

This functionality is primarily for performing simple queries only, particularly when loading and/or transforming data, and is not intended to replace loading data into tables and performing queries on the tables.

Query syntax and parameters

Query staged data files using a SELECT statement with the following syntax:

SELECT [<alias>.]$<file_col_num>[:<element>] [ , [<alias>.]$<file_col_num>[:<element>] , ...  ]
  FROM { <internal_location> | <external_location> }
  [ ( FILE_FORMAT => '<namespace>.<named_file_format>', PATTERN => '<regex_pattern>' ) ]
  [ <alias> ]

For the syntax for transforming data during a load, see COPY INTO <table>.

Important

The list of objects returned for an external stage may include one or more “directory blobs”; essentially, paths that end in a forward slash character (/), e.g.:

LIST @my_gcs_stage;

+---------------------------------------+------+----------------------------------+-------------------------------+
| name                                  | size | md5                              | last_modified                 |
|---------------------------------------+------+----------------------------------+-------------------------------|
| my_gcs_stage/load/                    |  12  | 12348f18bcb35e7b6b628ca12345678c | Mon, 11 Sep 2019 16:57:43 GMT |
| my_gcs_stage/load/data_0_0_0.csv.gz   |  147 | 9765daba007a643bdff4eae10d43218y | Mon, 11 Sep 2019 18:13:07 GMT |
+---------------------------------------+------+----------------------------------+-------------------------------+

These blobs are listed when directories are created in the Google Cloud console rather than using any other tool provided by Google.

当对象列表包含目录 blob 时,引用暂存区的 SELECT 语句可能会失败。为了避免错误,当暂存区的文件列表包含目录 blob 时,我们建议使用文件模式匹配来识别要包含的文件(即 PATTERN 子句)。

Required parameters

[alias.]$file_col_num[:element] [ , [alias.]$file_col_num[:element] , ... ]

指定在内部或外部位置暂存的数据文件中的一组显式字段/列,其中:

alias

指定 FROM 子句中定义的可选“table”别名(如果有)。

file_col_num

Specifies the positional number of the field/column (in the file) that contains the data to be loaded (1 for the first field, 2 for the second field, etc.)

element

Specifies the path and element name of a repeating value (applies only to semi-structured data files).

internal_location or external_location

指定暂存数据文件的位置:

  • internal_location is the URI specifier for the location in Snowflake where files containing data are staged:

    @[namespace.]internal_stage_name[/path]Files are in the specified named internal stage.
    @[namespace.]%table_name[/path]Files are in the stage for the specified table.
    @~[/path]Files are in the stage for the current user.
    - external_location is the URI specifier for the named external stage or external location (Amazon S3, Google Cloud Storage, or Microsoft Azure) where files containing data are staged:
    widths:

    40, 60

    delim:
    @[namespace.]external_stage_name[/path]Files are in the specified named external stage.
    Where:
    - namespace is the database and/or schema in which the internal or external stage resides. It is optional if a database and schema are currently in use within the user session; otherwise, it is required.
    - The optional path parameter restricts the set of files being queried to the files under the folder prefix. If path is specified, but no file is explicitly named in the path, all data files in the path are queried.
    - The URI string for an external storage location (Amazon S3, Google Cloud Storage, or Microsoft Azure) must be enclosed in single quotes; however, you can enclose any URI string in single quotes, which allows special characters, including spaces, in location and file names. For example:
    Internal:

    '@~/path 1/file 1.csv'

    '@%my table/path 1/file 1.csv'
    '@my stage/path 1/file 1.csv'
    - Relative path modifiers such as /./ and /../ are interpreted literally, because “paths” are literal prefixes for a name. For example:
    S3:

    COPY INTO mytable FROM @mystage/./../a.csv

    In these COPY statements, the system look for a file literally named ./../a.csv in the storage location.

Optional parameters

( FILE_FORMAT => 'namespace.named_file_format' )

指定一种命名文件格式,该格式描述要查询的暂存数据文件的格式。

Note that this parameter is optional if either of the following conditions are true:

  • The files are formatted in the default file format (CSV) with the default delimiters: , (as the field delimiter) and the new line character (as the record delimiter).
  • 文件位于内部或外部暂存区,暂存区定义描述了文件格式。

如果在用户会话的当前命名空间中引用文件格式,则可以省略格式标识符两边的单引号。

Otherwise, this parameter is required. For more details, see File Formats (in this topic).

namespace optionally specifies the database and/or schema for the table, in the form of database_name.schema_name or schema_name. It is optional if a database and schema are currently in use within the user session; otherwise, it is required.

如果标识符包含空格、特殊字符或大小写混合字符,则整个字符串必须放在双引号内。放在双引号内的标识符也区分大小写。

PATTERN => 'regex_pattern'

正则表达式模式字符串用单引号括起来,可指定外部暂存区上要匹配的文件名和/或路径。

Tip

为了获得最佳性能,请尽量避免应用对大量文件进行筛选的模式。

alias

指定暂存文件的内部/外部位置的“表”别名。

File formats

To parse a staged data file, it is necessary to describe its file format. The default file format is character-delimited UTF-8 text (i.e. CSV), with the comma character (,) as the field delimiter and new line character as the record delimiter. If the source data is in another format (JSON, Avro, etc.), you must specify the corresponding file format type (and options).

要显式指定文件格式选项,请通过以下方式之一进行设置:

Querying staged data files:

作为为命名文件格式或暂存区对象指定的文件格式选项。然后可以在 SELECT 语句中引用命名的文件格式/暂存区对象。

Loading columns from staged data files:
  • As file format options specified directly in the COPY INTO <table>.
  • As file format options specified for a named file format or stage object. The named file format/stage object can then be referenced in the COPY INTO <table> statement.

Query examples

Example 1: Query columns in a CSV file

下面的示例演示如何暂存多个 CSV 数据文件(具有相同的文件格式),然后查询文件中的数据列。

此示例假定文件具有以下名称,并且位于 macOS 或 Linux 环境的根目录中:

  • /tmp/data1.csv contains two records:

    a|b
    c|d
  • /tmp/data2.csv contains two records:

    e|f
    g|h

要暂存和查询文件,请执行以下操作:

-- Create a file format.
CREATE OR REPLACE FILE FORMAT myformat TYPE = 'csv' FIELD_DELIMITER = '|';

-- Create an internal stage.
CREATE OR REPLACE STAGE mystage1;

-- Stage the data files.
PUT file:///tmp/data*.csv @mystage1;

-- Query the filename and row number metadata columns and the regular data columns in the staged file.
-- Optionally apply pattern matching to the set of files in the stage and optional path.
-- Note that the table alias is provided to make the statement easier to read and is not required.
SELECT t.$1, t.$2 FROM @mystage1 (file_format => 'myformat', pattern=>'.*data.*[.]csv.gz') t;

+----+----+
| $1 | $2 |
|----+----|
| a  | b  |
| c  | d  |
| e  | f  |
| g  | h  |
+----+----+

SELECT t.$1, t.$2 FROM @mystage1 t;

+-----+------+
| $1  | $2   |
|-----+------|
| a|b | NULL |
| c|d | NULL |
| e|f | NULL |
| g|h | NULL |
+-----+------+

Note

The file format is required in this example to correctly parse the fields in the staged files. In the second query, the file format is omitted, causing the | field delimiter to be ignored and resulting in the values returned for $1 and $2.

However, if the file format is included in the stage definition, you can omit it from the SELECT statement. See Example 3: Query elements in a JSON file.

Example 2: Call functions when querying a staged data file

Get the ASCII code for the first character of each column in the data files staged in Example 1: Query columns in a CSV file:

SELECT ascii(t.$1), ascii(t.$2) FROM @mystage1 (file_format => myformat) t;

+-------------+-------------+
| ASCII(T.$1) | ASCII(T.$2) |
|-------------+-------------|
|          97 |          98 |
|          99 |         100 |
|         101 |         102 |
|         103 |         104 |
+-------------+-------------+

Note

If the file format is included in the stage definition, you can omit it from the SELECT statement. See Example 3: Query elements in a JSON file.

Example 3: Query elements in a JSON file

此示例演示如何暂存包含以下对象的 JSON 数据文件,然后在文件中查询对象中的单个元素:

{"a": {"b": "x1","c": "y1"}},
{"a": {"b": "x2","c": "y2"}}

This example assumes the file is named /tmp/data1.json and is located in the root directory in a macOS or Linux environment.

要暂存和查询文件,请执行以下操作:

-- Create a file format
CREATE OR REPLACE FILE FORMAT my_json_format TYPE = 'json';

-- Create an internal stage
CREATE OR REPLACE STAGE mystage2 FILE_FORMAT = my_json_format;

-- Stage the data file
PUT file:///tmp/data1.json @mystage2;

-- Query the repeating a.b element in the staged file
SELECT parse_json($1):a.b FROM @mystage2/data1.json.gz;

+--------------------+
| PARSE_JSON($1):A.B |
%--------------------%
| "x1"               |
| "x2"               |
+--------------------+