Stronger UTF-8 validation for external files¶
This behavior change has been implemented with the 7.34 release. For the most up-to-date details about behavior changes, see the Behavior Change Log.
Snowflake has stronger UTF-8 validation for external files.
- Before the change:
- When you query external Avro, Parquet, Orc, CSV, JSON, or XML files that contain invalid UTF-8 data, the queries usually succeed. 
- After the change:
- When you query external Avro, Parquet, Orc, CSV, JSON, or XML files on a stage that contain invalid UTF-8 data, the queries fail. - If you load external files with COPY INTO <table> or Snowpipe that contain invalid UTF-8 data, Snowflake proceeds with the copy option - ON_ERRORspecified in the object definition.- When you query an external table, Snowflake omits results for records that contain invalid UTF-8 data. After encountering invalid data, Snowflake continues to scan the file (similar to - ON_ERROR = CONTINUE) but doesn’t return an error message.
To avoid UTF-8 validation errors, Snowflake recommends that you specify REPLACE_INVALID_CHARACTERS = TRUE for your file format
so that any invalid UTF-8 characters will be replaced with the Unicode replacement character (�).
For Parquet files, you can also set BINARY_AS_TEXT = FALSE for your file format so that the columns
with no defined logical data type will be interpreted as binary data instead of as UTF-8 text.
Note that this behavior change does not apply to existing accounts that are currently loading invalid UTF8. It only affects new accounts. For any issues, contact Snowflake Support.
Ref: 1013 1014