Tutorial: Bulk loading from Amazon S3 using COPY¶
Introduction¶
This tutorial describes how to load data from files in an existing Amazon Simple Storage Service (Amazon S3) bucket into a table. In this tutorial, you will learn how to:
Create named file formats that describe your data files.
Create named stage objects.
Load data located in your S3 bucket into Snowflake tables.
Resolve errors in your data files.
The tutorial covers loading of both CSV and JSON data.
Prerequisites¶
The tutorial assumes the following:
You have a Snowflake account that is configured to use Amazon Web Services (AWS) and a user with a role that grants the necessary privileges to create a database, tables, and virtual warehouse objects.
You have SnowSQL installed.
Refer to the Snowflake in 20 minutes for instructions to meet these requirements.
Snowflake provides sample data files in a public Amazon S3 bucket for use in this tutorial. But before you start, you need to create a database, tables, and a virtual warehouse for this tutorial. These are the basic Snowflake objects needed for most Snowflake activities.
About the sample data files¶
Snowflake provides sample data files staged in a public S3 bucket.
Note
In regular use, you would stage your own data files using the AWS Management Console, AWS Command Line Interface, or an equivalent client application. See the Amazon Web Services (https://aws.amazon.com/console/) documentation for instructions.
The sample data files include sample contact information in the following formats:
CSV files that contain a header row and five records. The field delimiter is the pipe (
|
) character. The following example shows a header row and one record:ID|lastname|firstname|company|email|workphone|cellphone|streetaddress|city|postalcode 6|Reed|Moses|Neque Corporation|eget.lacus@facilisis.com|1-449-871-0780|1-454-964-5318|Ap #225-4351 Dolor Ave|Titagarh|62631
A single file in JSON format that contains one array and three objects. The following is an example of an array that contains one of the objects:
[ { "customer": { "address": "509 Kings Hwy, Comptche, Missouri, 4848", "phone": "+1 (999) 407-2274", "email": "blankenship.patrick@orbin.ca", "company": "ORBIN", "name": { "last": "Patrick", "first": "Blankenship" }, "_id": "5730864df388f1d653e37e6f" } }, ]
Creating the database, tables, and warehouse¶
Execute the following statements to create a database, two tables (for csv and json data), and a virtual warehouse needed for this tutorial. After you complete the tutorial, you can drop these objects.
CREATE OR REPLACE DATABASE mydatabase;
CREATE OR REPLACE TEMPORARY TABLE mycsvtable (
id INTEGER,
last_name STRING,
first_name STRING,
company STRING,
email STRING,
workphone STRING,
cellphone STRING,
streetaddress STRING,
city STRING,
postalcode STRING);
CREATE OR REPLACE TEMPORARY TABLE myjsontable (
json_data VARIANT);
CREATE OR REPLACE WAREHOUSE mywarehouse WITH
WAREHOUSE_SIZE='X-SMALL'
AUTO_SUSPEND = 120
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED=TRUE;
Note the following:
The
CREATE DATABASE
statement creates a database. The database automatically includes a schema named ‘public’.The
CREATE TABLE
statements create target tables for CSV and JSON data. The tables are temporary, that is, they persist only for the duration of the user session and are not visible to other users.The
CREATE WAREHOUSE
statement creates an initially suspended warehouse. The statement also setsAUTO_RESUME = true
, which starts the warehouse automatically when you execute SQL statements that require compute resources.
Create file format objects¶
When you load data files from an S3 bucket into a table, you must describe the format of the file and specify how the data in the file should be interpreted and processed. For example, if you are loading pipe-delimited data from a CSV file, you must specify that the file uses the CSV format with pipe symbols as delimiters.
When you execute the COPY INTO <table> command, you specify this format information. You can
either specify this information as options in the command (e.g.
TYPE = CSV
, FIELD_DELIMITER = '|'
, etc.) or you can specify a
file format object that contains this format information. You can create a named file
format object using the CREATE FILE FORMAT command.
In this step, you create file format objects describing the data format of the sample CSV and JSON data provided for this tutorial.
Creating a file format object for CSV data¶
Execute the CREATE FILE FORMAT command
to create the mycsvformat
file format.
CREATE OR REPLACE FILE FORMAT mycsvformat
TYPE = 'CSV'
FIELD_DELIMITER = '|'
SKIP_HEADER = 1;
Where:
TYPE = 'CSV'
indicates the source file format type. CSV is the default file format type.FIELD_DELIMITER = '|'
indicates the ‘|’ character is a field separator. The default value is ‘,’.SKIP_HEADER = 1
indicates the source file includes one header line. The COPY command skips these header lines when loading data. The default value is 0.
Creating a file format object for JSON data¶
Execute the CREATE FILE FORMAT command to create
the myjsonformat
file format.
CREATE OR REPLACE FILE FORMAT myjsonformat TYPE = 'JSON' STRIP_OUTER_ARRAY = TRUE;
Where:
TYPE = 'JSON'
indicates the source file format type.STRIP_OUTER_ARRAY = TRUE
directs the COPY command to exclude the root brackets ([]) when loading data to the table.
Create stage objects¶
A stage specifies where data files are stored (i.e. “staged”) so that the data in the files can be loaded into a table. A named external stage is a cloud storage location managed by Snowflake. An external stage references data files stored in a S3 bucket. In this case, we are creating a stage that references the sample data files necessary to complete the tutorial.
Creating a named external stage is useful if you want multiple users or processes to upload files. If you plan to stage data files to load only by you, or to load only into a single table, then you may prefer to use your user stage or the table stage. For information, see Bulk loading from Amazon S3.
In this step, you create named stages for the different types of sample data files.
Creating a stage for CSV data files¶
Execute CREATE STAGE to create the my_csv_stage
stage:
CREATE OR REPLACE STAGE my_csv_stage
FILE_FORMAT = mycsvformat
URL = 's3://snowflake-docs';
Creating a stage for JSON data files¶
Execute CREATE STAGE to create the my_json_stage
stage:
CREATE OR REPLACE STAGE my_json_stage
FILE_FORMAT = myjsonformat
URL = 's3://snowflake-docs';
Note
In regular use, if you were creating a stage that pointed to your private data files, you would reference a storage integration created using CREATE STORAGE INTEGRATION by an account administrator (i.e. a user with the ACCOUNTADMIN role) or a role with the global CREATE INTEGRATION privilege:
CREATE OR REPLACE STAGE external_stage FILE_FORMAT = mycsvformat URL = 's3://private-bucket' STORAGE_INTEGRATION = myint;
Copy data into the target table¶
Execute COPY INTO <table> to load staged data into the target tables.
CSV¶
To load the data from the sample CSV files:
Start by loading the data from one of the files in the
/tutorials/dataloading/
prefix (folder) namedcontacts1.csv
in themycsvtable
table. Execute the following:COPY INTO mycsvtable FROM @my_csv_stage/tutorials/dataloading/contacts1.csv ON_ERROR = 'skip_file';
Where:
The
FROM
clause specifies the location of the staged data file (stage name followed by the file name).The
ON_ERROR = 'skip_file'
clause specifies what to do when the COPY command encounters errors in the files. In this case, when the command encounters a data error on any of the records in a file, it skips the file. If you do not specify an ON_ERROR clause, the default isabort_statement
, which aborts the COPY command on the first error encountered on any of the records in a file.
The COPY command returns a result showing the name of the file copied and related information:
+---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+ | file | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name | |---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------| | s3://snowflake-docs/tutorials/dataloading/contacts1.csv | LOADED | 5 | 5 | 1 | 0 | NULL | NULL | NULL | NULL | +---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
Load the rest of the staged files in the
mycsvtable
table.The following example uses pattern matching to load data from files that match the regular expression
.*contacts[1-5].csv
into themycsvtable
table.COPY INTO mycsvtable FROM @my_csv_stage/tutorials/dataloading/ PATTERN='.*contacts[1-5].csv' ON_ERROR = 'skip_file';
Where the
PATTERN
clause specifies that the command should load data from the filenames matching this regular expression.*contacts[1-5].csv
.The COPY command returns a result showing the name of the file copied and related information:
+---------------------------------------------------------+-------------+-------------+-------------+-------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------------------+-------------------------+ | file | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name | |---------------------------------------------------------+-------------+-------------+-------------+-------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------------------+-------------------------| | s3://snowflake-docs/tutorials/dataloading/contacts2.csv | LOADED | 5 | 5 | 1 | 0 | NULL | NULL | NULL | NULL | | s3://snowflake-docs/tutorials/dataloading/contacts3.csv | LOAD_FAILED | 5 | 0 | 1 | 2 | Number of columns in file (11) does not match that of the corresponding table (10), use file format option error_on_column_count_mismatch=false to ignore this error | 3 | 1 | "MYCSVTABLE"[11] | | s3://snowflake-docs/tutorials/dataloading/contacts4.csv | LOADED | 5 | 5 | 1 | 0 | NULL | NULL | NULL | NULL | | s3://snowflake-docs/tutorials/dataloading/contacts5.csv | LOADED | 6 | 6 | 1 | 0 | NULL | NULL | NULL | NULL | +---------------------------------------------------------+-------------+-------------+-------------+-------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------------------+-------------------------+
Note the following highlights in the result:
The data in
contacts1.csv
is ignored because you already loaded the data successfully.The data in these files was loaded successfully:
contacts2.csv
,contacts4.csv
, andcontacts5.csv
.The data in
contacts3.csv
was skipped due to 2 data errors. The next step in this tutorial addresses how to validate and fix the errors.
JSON¶
Load the contacts.json
staged data file into the myjsontable
table.
COPY INTO myjsontable FROM @my_json_stage/tutorials/dataloading/contacts.json ON_ERROR = 'skip_file';
The COPY returns a result showing the name of the file copied and related information:
+---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| s3://snowflake-docs/tutorials/dataloading/contacts.json | LOADED | 3 | 3 | 1 | 0 | NULL | NULL | NULL | NULL |
+---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
Clean up¶
Congratulations, you have successfully completed the tutorial.
Tutorial clean up (optional)¶
Execute the following DROP <object> commands to return your system to its state before you began the tutorial:
DROP DATABASE IF EXISTS mydatabase; DROP WAREHOUSE IF EXISTS mywarehouse;
Dropping the database automatically removes all child database objects such as tables.