SnowConvert AI: Data validation [Preview]

Data Validation uses the data-validation skill from the Snowflake AIM Migration Agent. For more details, contact snowconvert-support@snowflake.com.

The Data Validation feature of SnowConvert AI provides a fault-tolerant, scalable way to verify that data migrated into Snowflake matches the data in the original source system. It runs as a Cloud Data Validation workflow on the same infrastructure used by Data Migration, so you can migrate and validate with the same Orchestrator and Workers.

Cloud Data Validation is designed for migration scenarios where you are moving data from a system that you plan to decommission and need confidence that the migrated data is correct before cutting over. Supported source platforms for cloud validation workflows are SQL Server, Amazon Redshift, Teradata, Oracle, and PostgreSQL. Azure Synapse support is planned.

Note

SnowConvert AI cloud data migration and validation do not require the Snowflake CLI. The Orchestrator and Workers connect through the Snowflake Python connector (snowflake-connector-python). Configure named connections in ~/.snowflake/connections.toml or ~/.snowflake/config.toml. The SNOWFLAKE_CONNECTION_NAME environment variable can reference the target Snowflake connection. Running Cloud Data Validation consumes Snowflake compute (warehouse queries for L2 metrics and L3 comparisons), so Snowflake credits apply to those queries.

Note

SnowConvert AI Data Migration and Data Validation are provided under the Snowflake Conversion Software Terms — a royalty-free internal-use license. There is no per-seat or per-table fee. Snowflake credits still apply to warehouse queries run during validation.

Architecture overview

Cloud Data Validation uses the same two components as Data Migration: an Orchestrator and one or more Workers.

  • The Orchestrator connects to the Snowflake account. It requires privileges to create and operate the SNOWCONVERT_AI database, where workflow, task, and validation metadata is stored.
  • One or more Workers connect to both the source system and the Snowflake account. Workers run validation queries on both sides, export intermediate results, upload them to Snowflake, and write their outcomes to shared results tables. Workers pick up tasks created by the Orchestrator and process them in parallel.
  • Validation results (schema, metrics, and row-level outcomes) are ingested into the shared results tables via Snowpipe by default and then evaluated by the Orchestrator.

Each Worker that executes validation tasks must have the validation runtime available in the same environment the Snowflake AIM Migration Agent or scai starts for that process. Workers that only run data-migration tasks can skip that runtime; they will not pick up data_validation tasks from the queue.

Deployment options

The Orchestrator and Workers can be deployed in multiple ways:

  • Both on Snowpark Container Services (in the Snowflake account).
  • Both in the customer’s environment, including custom hardware, virtual machines, or containers.
  • Orchestrator on Snowpark Container Services and Workers in the customer’s environment, or the other way around.

The following requirements apply to the environment where the Orchestrator and Workers run:

  • Use the Snowflake AIM Migration Agent with Snowflake CoCo, or the SnowConvert AI CLI (scai) from a SnowConvert AI project. The skill and CLI install and manage the underlying runtime when needed.
  • Workers typically require an ODBC driver to connect to the source system. For Teradata, the pure-Python teradatasql driver is supported and preferred when the skill or environment provides it.
  • The Orchestrator must be able to connect to the Snowflake account using a role that has privileges to create the SNOWCONVERT_AI database and create schemas and objects within it.

Validation levels

Cloud Data Validation performs comparisons between each source table (or view) and its corresponding target in Snowflake at three increasingly detailed levels. Each level can be enabled or disabled independently in the workflow configuration.

Schema validation (L1)

Schema validation confirms that the structure of each migrated table is preserved in Snowflake. It compares the following attributes between source and target:

  • Table name
  • Column names
  • Ordinal position of each column
  • Data types
  • Character maximum length for text columns
  • Numeric precision and scale for numeric columns
  • Nullability
  • Row count

Metrics validation (L2)

Metrics validation confirms that aggregate statistics of the migrated data match the original source. Specific metrics vary by column data type, but metrics validation typically compares:

  • Row count (table level)
  • Minimum value
  • Maximum value
  • Sum
  • Average
  • Null count
  • Distinct count
  • Standard deviation
  • Variance

Row validation (L3)

Row validation performs row-level or cell-level comparison between source and target. Configure the mode with validation_configuration.row_validation_mode (see Validation configuration):

  • hybrid (default when L3 is enabled): two-phase flow—MD5-chunked row fingerprint first, then cell drilldown on partitions that fail the first phase (reduces cost versus pure cell on large tables).
  • row: MD5-chunked whole-row comparison using index_column_list alignment.
  • cell: cell-level comparison with per-column mismatch reporting.

Row validation is disabled by default and is typically applied only to the tables where it is needed because it is the most resource-intensive level.

Thresholds and tolerances

Numeric comparisons support thresholds at two levels:

  • comparison_configuration.tolerance: Relative tolerance for L2 metric comparisons (default 0.001, or 0.1%). Values within this fraction of each other are treated as matching for numeric columns.
  • max_failed_rows_number: L3 early-stop threshold. Set globally under validation_configuration or per table; per-table values take precedence (default 1000). After this many mismatched rows are collected in a partition, collection stops, but the failure count is still reported.

Example:

validation_configuration:
  schema_validation: true
  metrics_validation: true
  row_validation: true
  max_failed_rows_number: 500
comparison_configuration:
  tolerance: 0.01
tables:
  - fully_qualified_name: mydb.dbo.orders
    index_column_list:
      - order_id
    max_failed_rows_number: 100

Prerequisites

Before you use Cloud Data Validation, make sure the following are in place:

  • Python 3.11+: The Data Migration Orchestrator and Data Exchange Worker require Python 3.11 or later (tested on 3.11 and 3.12). The Snowflake AIM Migration Agent or scai installs and manages worker dependencies when you start a workflow locally.
  • SnowConvert AI project and skill: Snowflake CoCo with the Snowflake AIM Migration Agent, or the SnowConvert AI CLI (scai) with a migration project. Use the data-validation skill in the Snowflake AIM Migration Agent for guided validation, or the scai data validate … commands in this page. Workers that run validation tasks receive the validation runtime through the same install path the skill or scai uses for that worker process.
  • Snowflake access: Connections for the Orchestrator and Workers in your Snowflake config.toml or connections.toml, using a role that can create the SNOWCONVERT_AI database and its objects. You do not need the Snowflake CLI; connections use the Snowflake Python connector. The first time the Orchestrator starts, it creates that database and related resources if they don’t exist yet. On later runs, use a role that can administer SNOWCONVERT_AI and its objects; sticking with the same role you used for the initial creation is the simplest way to avoid permission issues.
  • Source connectivity: For SQL Server and Redshift sources, an ODBC driver on the machine where Workers run. For Oracle, install Oracle Instant Client and a matching ODBC driver on Workers (same [connections.source.oracle] patterns as data migration). For PostgreSQL, Workers use Npgsql (no ODBC install; same [connections.source.postgresql] and ssl_mode patterns as data migration). For Teradata, prefer the teradatasql driver when provided by your environment, or configure pyodbc with the Teradata ODBC driver. Programmatic Access Tokens (PATs) are recommended for Snowflake connections; see Connecting to Snowflake with a PAT.
  • Target data available: The tables you want to validate must already exist in Snowflake, either because you migrated them using Data Migration or by another means. For accurate validation, don’t alter the migrated data between the migration and the validation run.
  • Snowpark Container Services (optional): If you deploy the Orchestrator or Workers on Snowflake compute, your account needs SPCS. See the Snowpark Container Services overview. Running both components outside Snowflake doesn’t require SPCS.

Source-platform specifics

Workers reuse the same [connections.source.*] TOML as data migration. The tabs below call out validation-specific behavior and common source-side deployment patterns. For full connectivity examples (including IAM Redshift, Oracle, PostgreSQL, and Teradata regular / write_nos / tpt migration settings), see SnowConvert AI: Data migration.

  • Connectivity: ODBC to SQL Server from the Worker host; follow the same driver and encryption guidance as in SnowConvert AI: Data migration (SQL Server tab).
  • Partitioning: Use column_names_to_partition_by and target_partition_size_mb / target_partition_size_rows in the validation JSON to keep L2/L3 work within reasonable bounds.
  • L3 alignment: Set index_column_list (and target_index_column_list when names differ) so row and hybrid modes can align rows deterministically.

Warning

For accurate validation and to avoid false negatives, don’t alter the migrated data during the validation process.

Setup

Installation

Follow the Snowflake AIM Migration Agent setup in Snowflake CoCo so the Snowflake AIM Migration Agent, scai, and worker dependencies (including validation components when you run validation workflows) are installed for you. For manual CLI-only use, install the SnowConvert AI CLI and work from a project directory; scai data validate start and related commands resolve missing worker-side components when orchestrating local workers.

Workers that pick up only data-migration tasks do not require the validation runtime; workers that execute data-validation tasks must have validation support available in the same environment the skill or scai prepares for that worker process.

Generate and verify Worker configuration

From a SnowConvert AI project directory, generate a Worker TOML pre-filled from your source connection, then verify connectivity before starting validation:

# Generate Worker config (default path shown; use -y to overwrite without prompting):
scai data worker generate-config .scai/settings/DataExchangeWorkerConfig.toml

# Start a local Worker with that config:
scai data worker start --local .scai/settings/DataExchangeWorkerConfig.toml

# Verify source connectivity (--json for automation):
scai connection test -l teradata -s my-teradata --json
scai connection test -l sqlserver -s my-sqlserver --json
scai connection list -l teradata --json

See Worker configuration and SnowConvert AI: Data migration for the full TOML reference.

Usage

To validate migrated data using this solution, complete the following high-level steps:

  1. Start the Orchestrator.
  2. Start the Workers.
  3. Create a Cloud Data Validation Workflow.
  4. Monitor the validation workflow until completion.

A Cloud Data Validation Workflow is a job submitted to the system that describes which tables to validate, at which levels, and with which comparison rules. You can submit multiple workflows simultaneously and monitor them. The Orchestrator breaks each workflow into smaller tasks, one or more per table, and dispatches them to available Workers.

The Orchestrator that runs validation is the same process that runs Data Migration. If you have already set up and started an Orchestrator and Workers for migration, you can reuse them for validation: ensure workers are started with a configuration and environment that includes validation support (the data-validation skill or scai data validate start / create-workflow with --start-worker handles this when running locally), then create a validation workflow as described in Creating a Cloud Data Validation Workflow.

Ask the Snowflake AIM Migration Agent for validation—for example, “Run cloud data validation for my project”—or use the scai data validate … commands below. See the Snowflake AIM Migration Agent for guided prompts.

Using SCAI CLI

Cloud validation commands mirror migration under scai data validate …, sharing the same scai data worker start and scai data orchestrator branches. Deprecated names still run with a short warning:

Deprecated (still runs)Preferred command
scai data generate-cloud-validation-configscai data validate generate-config
scai data cloud-validatescai data validate create-workflow
scai data cloud-validate-statusscai data validate status
scai data cloud-list-validationsscai data validate list

Generate a YAML validation workflow file from the SCAI project (default .scai/config/data-validation-config.yaml):

scai data validate generate-config
scai data validate generate-config --where "source.schema = 'public'"
scai data validate generate-config -o my-validation.yaml --affinity my-team
scai data validate generate-config --help

Create a validation workflow (options mirror migration’s create-workflow; workflow file must be YAML (.yaml or .yml) format:

scai data validate create-workflow --config my-validation.yaml --connection my-snowflake
scai data validate create-workflow --config my-validation.yaml --connection my-snowflake --watch
scai data validate create-workflow --help

Local all-in-one project run (auto-generates config when missing unless --no-auto-config; forces local Worker + Orchestrator + watch, same as migration’s migrate start):

scai data validate start
scai data validate start --config .scai/config/data-validation-config.yaml --connection my-snowflake
scai data validate start --help

Status and listing:

scai data validate status MY_VALIDATION_WORKFLOW_xx_yy_zz --watch
scai data validate list --status running --limit 20

Use scai data orchestrator setup|stop and scai data worker start exactly as described in SnowConvert AI: Data migration.

Orchestrator metadata and environment

By default, workflow and task metadata objects are created under SNOWCONVERT_AI.DATA_MIGRATION, and data validation metadata is created under SNOWCONVERT_AI.DATA_VALIDATION. When you run a local Orchestrator through scai (for example scai data validate start with an implied orchestrator), the CLI sets the Snowflake connection from your project. Advanced deployments can override metadata location with Orchestrator environment variables (CUSTOM_SNOWFLAKE_DATABASE_FOR_METADATA, CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_MIGRATION_METADATA, CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_VALIDATION_METADATA); if you do, set matching snowflake_database_for_metadata and snowflake_schema_for_data_migration_metadata on each Worker (see Worker configuration).

The Orchestrator runs until you stop it. Cloud Data Validation Workflows require an active Orchestrator to complete. The Orchestrator can be safely stopped and resumed; ongoing workflows resume at that point.

Creating a Cloud Data Validation Workflow

Recommended: From your SnowConvert AI project directory, generate a workflow file and submit it with scai:

scai data validate generate-config -o .scai/config/data-validation-config.yaml
scai data validate create-workflow --config .scai/config/data-validation-config.yaml --connection my-snowflake --watch

Or use a single local run (generates config if needed, starts local Worker and Orchestrator, watches to completion):

scai data validate start --connection my-snowflake

Keep the following in mind:

  • The validation configuration specification is in Validation workflow configuration reference.
  • scai data validate create-workflow accepts a YAML (.yaml or .yml) workflow file (default .scai/config/data-validation-config.yaml from generate-config). The source_platform field must match your source: sqlserver, redshift, teradata, oracle, or postgresql.
  • Workflow names must be alphanumeric and cannot start with a digit.

New workflow rows are inserted into the WORKFLOW table in the data migration metadata schema (default SNOWCONVERT_AI.DATA_MIGRATION) with WORKFLOW_TYPE set to data-validation. Validation results and related objects are stored under the data validation metadata schema (default SNOWCONVERT_AI.DATA_VALIDATION).

To migrate data before validating, use the cloud migration flow in SnowConvert AI: Data migration (scai data migrate generate-config, create-workflow, or start).

Monitoring a Cloud Data Validation Workflow

Each workflow goes through different stages throughout its lifecycle:

  • Pending: No tasks have been created for this workflow yet.
  • Executing: Tasks have been created for this workflow and there are still tasks that haven’t reached a terminal state (COMPLETED or FAILED).
  • Completed: All tasks have reached a terminal state (COMPLETED or FAILED).

In the data validation metadata schema (default SNOWCONVERT_AI.DATA_VALIDATION), the following views can be queried to understand the status of validation workflows:

ViewDescription
TABLE_PROGRESSOne row per validated table. Summarizes overall validation status. Can be filtered by WORKFLOW_ID.
TABLE_PROGRESS_DETAILPer-table breakdown with partition-level L2/L3 status (VALID, INVALID, EXECUTION_ERROR). Can be filtered by WORKFLOW_ID.
DATA_VALIDATION_ERRORErrors encountered during validation. Can be filtered by WORKFLOW_ID.
DATA_VALIDATION_WARNINGNon-fatal warnings, for example, unsupported column types or metric exclusions. Can be filtered by WORKFLOW_ID.

In the same schema, the DATA_VALIDATION_DASHBOARD Streamlit dashboard provides a visual overview of validation progress and results. Dashboard tabs include:

TabContent
TablesAggregate per-table status with drill-down
SchemaL1 column-by-column diff
MetricsL2 comparison
RowsL3 row mismatch summary
CellL3 cell-level value diffs
Table progressPartition-level status
ErrorsExecution errors

Each tab includes a Download CSV button. The sidebar Export snapshot (CSV) bundles all sections into one file.

Console output (local runs)

When you run validation locally with scai data validate start or a foreground Orchestrator/Worker, the CLI prints real-time progress with per-table SUCCESS / FAILURE indicators as validation runs.

CSV reports (optional, local workflows)

For local validation runs, set top-level output_directory_path in the workflow YAML to write CSV reports to disk. Separate files are generated for L1 schema results, L2 metrics results (per column with source and target values), and L3 row-level mismatches. Each row includes a STATUS column (SUCCESS or FAILURE).

You can also inspect validation queries in QUERY_HISTORY using the QUERY_TAG values set by the Orchestrator and Workers. See Query tagging.

Validation outcomes are classified into three categories:

CategoryDescription
OKValues match exactly between the source database and Snowflake.
WarningThe Snowflake table has minor differences that don’t affect the data (for example, higher numeric precision).
ErrorValues don’t match between the original database and the Snowflake database.

When L3 row validation is enabled, mismatches can be classified as FAILURE, DUPLICATE_SOURCE, or DUPLICATE_TARGET. In cell or hybrid mode, reports include primary key values from index_column_list and source versus target values per column. If max_failed_rows_number is reached, collection stops for that partition, but the total failure count is still reported.

Performance and parallelism

Cloud validation scales horizontally:

  • Multiple Workers run simultaneously, each dequeuing tasks from a shared Snowflake queue (SNOWCONVERT_AI.DATA_MIGRATION task procedures by default).
  • Workers are stateless; workflow state lives in Snowflake (SNOWCONVERT_AI.DATA_VALIDATION), so jobs are fault-tolerant and resumable.
  • On each Worker, max_parallel_tasks in TOML controls thread-level parallelism (see Managing Workers).
  • Optional top-level max_threads in the workflow YAML controls how many tables are validated concurrently on a single host: "auto" (default, CPU-based) or an integer from 2 to 32.
  • For large tables, set column_names_to_partition_by and target_partition_size_mb or target_partition_size_rows so L2/L3 scans run in parallel partitions. Optional chunk_number on a table entry overrides the partition count for local validation runs.
  • L3 cell validation streams rows in chunks on large result sets to limit memory use.

Considerations and recommendations

Connecting to Snowflake with a PAT

Use Programmatic Access Tokens (PATs) for connections used by the Orchestrator and Workers. This avoids the need to constantly authenticate through the browser or with an authenticator app. You need to establish a Network Policy or temporarily bypass the requirement for a Network Policy (which you can do from Snowsight).

Running Orchestrator and Workers on SPCS

To leverage Snowflake compute for these tasks:

  1. Prepare Docker images that bundle the Orchestrator and Worker runtimes and configuration expected by your SnowConvert AI release (see Snowflake guidance for SPCS service images).
  2. Push those Docker images to an Image Repository in Snowflake.
  3. Execute the Orchestrator and/or Worker images using Snowpark Container Services.

Keep the following in mind:

  • Run them as Services, not Jobs.
  • You can run only one component (Orchestrator or Workers) in SPCS and the other on a different platform.
  • Monitor the SPCS service and suspend it when it isn’t being used.
  • Depending on the network configuration of the source system, you might need to configure an External Access Integration so that these services can connect to your source system.

Initial testing

For an early test run, use a separate validation configuration whose tables array lists only the table or small set of tables you want to validate. On each of those entries:

  • Set where_clause (and optionally target_where_clause) to an SQL-like predicate so only a subset of rows is validated, for example, a bounded primary-key range or a narrow date range in the source dialect.
  • Keep validation_configuration.row_validation disabled for the first run, and enable it later on a smaller subset of tables.
  • Use a small target_partition_size_mb or target_partition_size_rows to keep partitions tiny during the test.

After you confirm connectivity and results, create your full workflow: remove or relax where_clause, adjust partition sizes, and enable row validation only on the tables where it is needed.

Managing Workers

The time it takes to complete a validation workflow depends on many variables, but the number of Workers (and threads per Worker) has the greatest impact, as it determines how many validation tasks can be executed in parallel. Consider the following:

  • You don’t need to run two Workers on the same machine. If you want more parallelism on a single machine, increase the thread count in the Worker TOML (max_parallel_tasks under [application]) or use the options your skill/scai run exposes for local workers.
  • Network bandwidth greatly affects Worker speed and is shared between threads of a Worker.
  • Even with many Workers and threads running in parallel, the source system might not have enough resources to handle the load.
  • Keep a low Worker count to avoid overloading your source system.
  • Consider stopping some or all Workers when the source system is already under heavy load from other operations.

Query tagging

Both the Orchestrator and the Worker automatically set Snowflake’s QUERY_TAG session parameter on every query they submit. Tags are compact JSON strings containing identifiers such as the workflow ID, task ID, and component version. You can use these tags to filter and attribute validation queries in QUERY_HISTORY:

SELECT query_text, query_tag, start_time
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE TRY_PARSE_JSON(query_tag):DMVF_WORKFLOW_ID IS NOT NULL
ORDER BY start_time DESC;
Tag keyPresent onDescription
DMVF_VERSIONInfrastructure queriesComponent package version.
DMVF_WORKFLOW_IDTask-processing queriesWorkflow that originated the task.
DMVF_TASK_IDTask-processing queriesIndividual task identifier.
DMVF_ORCHESTRATOR_VERSIONOrchestrator task-processing queriesOrchestrator package version.
DMVF_WORKER_VERSIONWorker task-processing queriesWorker package version.

Configuration reference

Validation workflow file overview

Validation is driven by a single YAML or JSON workflow file (default .scai/config/data-validation-config.yaml from scai data validate generate-config). Key sections:

SectionPurpose
source_platform / target_platformSource dialect and target (defaults to Snowflake)
validation_configurationGlobal L1/L2/L3 toggles, thresholds, early stopping
comparison_configurationNumeric tolerance and optional type mapping file
database_mappings / schema_mappingsSource-to-target name maps
tablesTables to validate, with optional per-table overrides (column selection, index keys, partitioning, thresholds)
viewsSame shape as tables, for view validation
output_directory_pathOptional. Directory for CSV reports on local runs
max_threadsOptional. "auto" or integer 2–32 for concurrent table validation

Generate a starter file from your SnowConvert AI project:

scai data validate generate-config -o .scai/config/data-validation-config.yaml

Validation workflow configuration reference

The validation workflow configuration file is a YAML document (serialized to JSON in Snowflake workflow metadata). The following sections describe its structure and properties. Property names use camelCase in YAML (for example, sourcePlatform and fullyQualifiedName).

Note

Quote identifiers that require quoting in your source or target dialect, the same way you would in SQL or JSON strings.

Top-level object

PropertyTypeRequiredDescription
source_platformStringYesSource dialect identifier: sqlserver, redshift, teradata, oracle, or postgresql. On Oracle and PostgreSQL projects, scai data validate generate-config sets oracle or postgresql automatically.
target_platformStringDefaults to Snowflake.
target_databaseStringDefault target database name for tables that don’t specify one.
validation_configurationObjectGlobal validation levels and options. See Validation configuration.
comparison_configurationObjectNumeric tolerance and optional type mapping file. See Comparison configuration.
database_mappingsObjectMap of source database names to Snowflake database names.
schema_mappingsObjectMap of source schema names to Snowflake schema names.
tablesArrayYesAt least one table to validate. See Per-table and per-view entry.
viewsArrayAdditional view entries. Uses the same shape as tables.
use_snowflake_computeBooleanWhen true, enables Snowflake-side computation paths where supported. Default false.
target_partition_size_rowsIntegerDesired rows per partition. Mutually exclusive with target_partition_size_mb. Must be greater than 0. Default is 200 MB when both are omitted.
target_partition_size_mbIntegerDesired MB per partition. Mutually exclusive with target_partition_size_rows. Must be greater than 0. Default is 200 MB when both are omitted.
use_snowpipe_for_resultsBooleanWhen true (default), L2/L3 validation results are ingested into the shared results tables via Snowpipe. Set to false to fall back to per-partition COPY INTO tasks.
output_directory_pathStringOptional. Directory path for CSV reports on local validation runs (L1, L2, and L3).
max_threadsString or IntegerOptional. "auto" (default) or an integer from 2 to 32 for concurrent table validation on one host.

Validation configuration

The validation_configuration object sets global defaults for which validation levels to run. Any field set here can be overridden per table by nesting a validation_configuration object on that table entry.

When validation_configuration is omitted, the Orchestrator applies these defaults: schema validation and metrics validation are enabled; row validation is disabled; when L3 is enabled, row_validation_mode defaults to hybrid; continue_on_failure defaults to false; max_failed_rows_number defaults to 1000; exclude_metrics defaults to false; apply_metric_column_modifier defaults to true; early_stopping is auto-enabled for hybrid L3; early_stop_check_interval_minutes defaults to 5.

PropertyTypeDescription
schema_validationBooleanLevel 1: schema and column consistency checks.
metrics_validationBooleanLevel 2: statistical metrics comparison.
row_validationBooleanLevel 3: row-level or cell-level data comparison.
row_validation_modeStringFor row validation: hybrid (default), row, or cell (see Row validation (L3)).
continue_on_failureBooleanWhether to continue to the next validation level after a failure.
max_failed_rows_numberIntegerCap on failed rows reported for L3 validation per partition; also the early-stop trigger (default 1000). Must be greater than 0 when set.
exclude_metricsBooleanWhether to exclude unsupported metric columns.
apply_metric_column_modifierBooleanWhether to apply metric column modifiers.
early_stoppingBooleanWhen true, stops remaining L3 partitions once max_failed_rows_number is reached. Auto-enabled for hybrid mode.
early_stop_check_interval_minutesIntegerHow often the Orchestrator polls mismatch counts when early_stopping is enabled (default 5). Required when early_stopping is explicitly true.

Comparison configuration

The comparison_configuration object controls numeric tolerance and optional type-mapping overrides used during comparisons.

PropertyTypeDescription
toleranceNumberNumeric comparison tolerance for metrics. Must be greater than 0 when set. Default 0.001 when omitted.
type_mapping_file_pathStringOptional path to a custom type mapping file for comparisons.

Per-table and per-view entry

Each object in tables (or views) describes one database object to validate.

PropertyTypeRequiredDescription
fully_qualified_nameStringYesSource object name. Format depends on the source platform. For Teradata, two-part names (database.table) are supported.
use_column_selection_as_exclude_listBooleanDefault false.
column_selection_listString[]Columns to include or exclude (literals and/or Python regex patterns). See Column filtering with regex patterns.
target_nameStringTarget object name override.
target_databaseStringPer-table target database override.
target_schemaStringPer-table target schema override.
where_clauseStringFilter on the source side.
target_where_clauseStringFilter on the target side.
index_column_listString[]Columns used to align rows on the source.
target_index_column_listString[]Columns used to align rows on the target.
column_mappingsObjectMap of source column name to target column name.
is_case_sensitiveBooleanCase sensitivity for identifiers.
chunk_numberIntegerManual partition count override for local validation runs. Must be greater than 0 when set.
max_failed_rows_numberIntegerOverrides the global cap for this object.
exclude_metricsBooleanPer-object metrics exclusion override.
apply_metric_column_modifierBooleanPer-object modifier override.
object_typeStringTypically TABLE or VIEW.
column_names_to_partition_byString[]Columns used for range-based (NTILE) partitioning during validation. Without this, the table is processed as a single partition.
target_partition_size_rowsIntegerPer-table override for desired rows per partition. Mutually exclusive with target_partition_size_mb. Must be greater than 0.
target_partition_size_mbIntegerPer-table override for desired MB per partition. Mutually exclusive with target_partition_size_rows. Must be greater than 0.
validation_configurationObjectNested object with the same fields as global validation_configuration to override defaults for this object only.

Partitioning

When column_names_to_partition_by is set, the Orchestrator splits the table into range-based partitions. Cloud Data Validation and Data Migration share the same sizing logic:

  1. Compute a target rows-per-partition from whichever user setting is provided. The two settings are mutually exclusive:

    • target_partition_size_rows is used as-is.
    • target_partition_size_mb is converted to rows using target_mb / avg_row_mb.
    • If neither is set, Cloud Data Validation defaults to 200 MB per partition.
  2. Apply an internal cap. System-imposed maximums (not user-configurable) limit partition size to safe infrastructure bounds.

  3. Derive the partition count as ceil(row_count / effective_rows_per_partition), or 1 when the entire table fits in a single partition.

Column filtering with regex patterns

The column_selection_list field on any table or view entry supports literal column names and Python-style regex patterns (mixed in the same list). Toggle inclusion versus exclusion with use_column_selection_as_exclude_list:

use_column_selection_as_exclude_listBehavior
false (default)Include mode — only matched columns are validated
trueExclude mode — all columns except matched ones are validated

Each entry is matched against every column name:

  • Literal — plain string (for example LOAD_DATE), case-insensitive unless is_case_sensitive: true
  • Regex — entry wrapped in single quotes with r"..." inside (for example 'r".*_TS"'), evaluated with Python re.match (anchored at the start of the name; prefix with .* to match anywhere)
tables:
  - fully_qualified_name: mydb.dbo.orders
    use_column_selection_as_exclude_list: true
    column_selection_list:
      - LOAD_DATE
      - 'r".*_TS"'
      - 'r".*_AT"'

Include only specific columns (literal + regex)

tables:
  - fully_qualified_name: mydb.dbo.orders
    use_column_selection_as_exclude_list: false
    column_selection_list:
      - ORDER_ID
      - CUSTOMER_ID
      - 'r"AMOUNT.*"'

This validates only ORDER_ID, CUSTOMER_ID, and any column whose name starts with AMOUNT (for example AMOUNT_USD, AMOUNT_TAX).

Exclude internal/staging columns by prefix

tables:
  - fully_qualified_name: mydb.dbo.customers
    use_column_selection_as_exclude_list: true
    column_selection_list:
      - 'r"_ETL_.*"'
      - 'r"STAGING_.*"'

Case-sensitive regex matching

By default matching is case-insensitive. Set is_case_sensitive: true on the table entry for case-sensitive matching on both literals and regex patterns:

tables:
  - fully_qualified_name: mydb.dbo.events
    is_case_sensitive: true
    use_column_selection_as_exclude_list: true
    column_selection_list:
      - 'r".*_ts"'

With is_case_sensitive: true, only columns ending in lowercase _ts are excluded; CREATED_TS would not match.

Note

Because r"..." entries contain double quotes, wrap them in single quotes in YAML. An empty column_selection_list with include mode (use_column_selection_as_exclude_list: false) validates no columns — always provide at least one entry in include mode. The CLI forwards entries verbatim to the server; no pattern expansion occurs at the CLI layer. Column filtering applies equally to entries in the tables and views arrays.

Behavior summary:

  • The list is evaluated per column; a column matched by any entry is considered selected.
  • Literal and regex entries can be mixed freely in the same list.

Common regex examples (Python re syntax):

PatternMatches
'r".*_TS"'Columns ending in _TS
'r"AMOUNT.*"'Columns starting with AMOUNT
'r".*_ID$"'Columns ending in _ID (anchored)
'r"(CREATED|UPDATED)_AT"'Columns like CREATED_AT or UPDATED_AT
'r"COL_\d+"'Columns like COL_1, COL_2, and so on
'r".*TEMP.*"'Columns containing TEMP anywhere in the name

Note

re.match anchors the pattern to the start of the column name. Prefix with .* to match anywhere in the name (for example 'r".*TEMP.*"').

Example: SQL Server validation

The following configuration runs schema and metrics validation on two SQL Server tables, overrides defaults on the second table, and enables row-level validation there:

{
  "source_platform": "sqlserver",
  "database_mappings": {
    "SampleStoreDB": "samplestoredb"
  },
  "schema_mappings": {
    "data_migration_cloud_test": "data_migration_cloud_test"
  },
  "validation_configuration": {
    "schema_validation": true,
    "metrics_validation": true,
    "row_validation": false
  },
  "comparison_configuration": {
    "tolerance": 0.001
  },
  "tables": [
    {
      "fully_qualified_name": "SampleStoreDB.data_migration_cloud_test.store_employee",
      "target_name": "target_employee",
      "column_names_to_partition_by": ["ID"]
    },
    {
      "fully_qualified_name": "SampleStoreDB.data_migration_cloud_test.Sales_Simple",
      "column_names_to_partition_by": ["ID"],
      "index_column_list": ["ID"],
      "target_partition_size_mb": 256,
      "validation_configuration": {
        "row_validation": true,
        "row_validation_mode": "cell",
        "max_failed_rows_number": 500
      },
      "where_clause": "is_deleted = 0"
    }
  ]
}

Example: Redshift validation after UNLOAD migration

The following configuration validates Redshift tables that were migrated using the UNLOAD extraction strategy. Schema and metrics validation are enabled globally, and row validation is enabled on a single high-value table:

{
  "source_platform": "redshift",
  "target_database": "TARGET_DB",
  "schema_mappings": {
    "ecommerce_raw": "ecommerce_raw"
  },
  "validation_configuration": {
    "schema_validation": true,
    "metrics_validation": true,
    "row_validation": false
  },
  "tables": [
    {
      "fully_qualified_name": "snowconvert_demo.ecommerce_raw.customers",
      "column_names_to_partition_by": ["customer_id"]
    },
    {
      "fully_qualified_name": "snowconvert_demo.ecommerce_raw.orders",
      "column_names_to_partition_by": ["order_id"],
      "index_column_list": ["order_id"],
      "validation_configuration": {
        "row_validation": true,
        "row_validation_mode": "row"
      }
    }
  ]
}

For the corresponding migration workflow, see Redshift UNLOAD in the Data Migration page.

Example: Oracle validation

The following configuration validates Oracle tables after a cloud data migration. Schema and metrics validation are enabled globally; row validation is enabled on one table with partition sizing:

source_platform: oracle
target_database: TARGET_DB
schema_mappings:
  HR: HR
validation_configuration:
  schema_validation: true
  metrics_validation: true
  row_validation: false
comparison_configuration:
  tolerance: 0.001
tables:
  - fully_qualified_name: HR.EMPLOYEES
    column_names_to_partition_by:
      - EMPLOYEE_ID
  - fully_qualified_name: HR.DEPARTMENTS
    column_names_to_partition_by:
      - DEPARTMENT_ID
    index_column_list:
      - DEPARTMENT_ID
    validation_configuration:
      row_validation: true
      row_validation_mode: row

For Worker TOML and scai connection add-oracle, see SnowConvert AI: Data migration (Oracle tab) and Source connection configuration examples.

Example: PostgreSQL validation

The following configuration validates PostgreSQL tables after a cloud data migration. Schema and metrics validation are enabled globally; row validation is enabled on one table with partition sizing:

source_platform: postgresql
target_database: TARGET_DB
schema_mappings:
  public: PUBLIC
validation_configuration:
  schema_validation: true
  metrics_validation: true
  row_validation: false
comparison_configuration:
  tolerance: 0.001
tables:
  - fully_qualified_name: public.customers
    column_names_to_partition_by:
      - customer_id
  - fully_qualified_name: public.orders
    column_names_to_partition_by:
      - order_id
    index_column_list:
      - order_id
    validation_configuration:
      row_validation: true
      row_validation_mode: row

For Worker TOML and scai connection add-postgresql, see SnowConvert AI: Data migration (PostgreSQL tab) and Source connection configuration examples.

Example: Teradata validation

The following configuration validates Teradata tables and a view using L1/L2 globally and hybrid L3 with early stopping:

source_platform: teradata
target_platform: Snowflake
target_database: MY_DATABASE
validation_configuration:
  schema_validation: true
  metrics_validation: true
  row_validation: true
  row_validation_mode: hybrid
  continue_on_failure: true
  max_failed_rows_number: 1000
  early_stopping: true
  early_stop_check_interval_minutes: 5
comparison_configuration:
  tolerance: 0.001
tables:
  - fully_qualified_name: my_database.sales_transactions
    target_schema: PUBLIC
    target_name: SALES_TRANSACTIONS
    index_column_list:
      - TRANSACTION_ID
    column_names_to_partition_by:
      - TRANSACTION_ID
    target_partition_size_mb: 200
  - fully_qualified_name: my_database.known_problematic_table
    target_schema: PUBLIC
    target_name: KNOWN_PROBLEMATIC_TABLE
    column_names_to_partition_by:
      - ORDER_ID
    validation_configuration:
      max_failed_rows_number: 100
      early_stop_check_interval_minutes: 1
views:
  - fully_qualified_name: my_database.sales_summary_view
    target_schema: PUBLIC
    target_name: SALES_SUMMARY_VIEW
    index_column_list:
      - ID
    target_index_column_list:
      - ID
    column_names_to_partition_by:
      - ID
    target_partition_size_rows: 50000

For Teradata views, L1 schema checks are reduced (column existence and datatype only). Use L2/L3 for deeper assurance. See Teradata: Complete configuration reference (view validation limitations and type mappings).

Teradata to Snowflake type mappings

During validation comparisons, these Teradata source types map automatically to Snowflake types:

Teradata typeSnowflake type
NUMERIC, NUMBER, DECIMALNUMBER
CHAR, VARCHARVARCHAR
CLOBTEXT
BYTEINT, SMALLINT, INTEGER, BIGINTNUMBER
FLOAT, REAL, DOUBLE PRECISIONFLOAT
DATEDATE
TIMETIME
TIMESTAMPTIMESTAMP_NTZ
TIMESTAMP WITH TIME ZONETIMESTAMP_TZ
BYTE, VARBYTE, BLOBBINARY
JSON, XMLVARIANT
ST_GEOMETRYGEOGRAPHY
INTERVAL types (YEAR, DAY, HOUR, and so on)VARCHAR
PERIOD(DATE), PERIOD(TIME), PERIOD(TIMESTAMP), and TIME ZONE variantsVARCHAR

Use comparison_configuration.type_mapping_file_path to supply a custom mapping file when needed.

Worker configuration

The Worker configuration file uses TOML (https://toml.io/) format and is shared between Data Migration and Cloud Data Validation. Workers that execute validation tasks must be started through a path that includes validation support (for example the data-validation skill or scai data validate start / create-workflow with --start-worker).

SectionPropertyTypeDescription
Top levelselected_task_sourceStringRequired for cloud workflows. Use "snowflake_stored_procedure".
[task_source.snowflake_stored_procedure]connection_nameStringSnowflake profile used for task-queue stored procedures, or "@SPCS_CONNECTION" on SPCS.
[application]max_parallel_tasksIntegerMaximum parallel tasks (threads).
[application]task_fetch_intervalIntegerSeconds between idle polls for new tasks.
[application]lease_refresh_intervalIntegerOptional. Seconds between lease renewals (default 120 in code defaults).
[application]affinityStringOptional. Worker affinity for task routing.
[application]snowflake_database_for_metadataStringOptional. Task-queue database (default SNOWCONVERT_AI). Must match CUSTOM_SNOWFLAKE_DATABASE_FOR_METADATA on the Orchestrator if overridden.
[application]snowflake_schema_for_data_migration_metadataStringOptional. Task-queue schema (default DATA_MIGRATION). Must match CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_MIGRATION_METADATA if overridden.
[application]local_results_directoryStringOptional. Export base directory (default ~/.data_exchange_agent/result_data; expanded to absolute at load).
[connections.source.*](per engine)ObjectSource connection(s).
[connections.target.snowflake_connection_name]connection_nameStringSnowflake profile for data sessions (validation queries, uploads).

When selected_task_source is "snowflake_stored_procedure", the Worker calls task-queue stored procedures using application.snowflake_database_for_metadata and application.snowflake_schema_for_data_migration_metadata. These values are independent of the Snowflake session defaults (SNOWFLAKE_DATABASE, SNOWFLAKE_SCHEMA) in the connection profile. Validation results metadata lives under SNOWCONVERT_AI.DATA_VALIDATION by default (CUSTOM_SNOWFLAKE_SCHEMA_FOR_DATA_VALIDATION_METADATA on the Orchestrator); Workers do not set that schema in TOML—they reach it through the same Snowflake connection and orchestrated tasks.

An example configuration file looks like this:

selected_task_source = "snowflake_stored_procedure"

[task_source.snowflake_stored_procedure]
connection_name = "connection_name"

[application]
max_parallel_tasks = 4
task_fetch_interval = 30
# Optional: only if the Orchestrator uses CUSTOM_SNOWFLAKE_* overrides for metadata location
# snowflake_database_for_metadata = "SNOWCONVERT_AI"
# snowflake_schema_for_data_migration_metadata = "DATA_MIGRATION"

# SQL Server connection (standard authentication)
[connections.source.sqlserver]
username = "username"
password = "password"
database = "database_name"
host = "127.0.0.1"
port = 1433

# Snowflake target connection (data sessions)
[connections.target.snowflake_connection_name]
connection_name = "connection_name"

Note

Only one source connection is needed.

Source connection configuration examples

The following examples show the supported source connection types for Cloud Data Validation.

1. SQL Server (standard authentication)

[connections.source.sqlserver]
username = "username"
password = "password"
database = "database_name"
host = "127.0.0.1"
port = 1433

The Worker automatically detects the best available ODBC driver for SQL Server, preferring newer versions (ODBC Driver 18 > 17 > 13 > 11). To manually select a driver, set odbc_driver to the exact name returned by pyodbc.drivers():

[connections.source.sqlserver]
odbc_driver = "ODBC Driver 17 for SQL Server"

The encrypt and trust_server_certificate parameters are optional. When omitted, the ODBC driver uses its default behavior:

  • ODBC Driver 17 and below: encryption is disabled by default.
  • ODBC Driver 18 and above: encryption is mandatory by default.
[connections.source.sqlserver]
username = "sa"
password = "mypassword"
database = "mydb"
host = "my-server.example.com"
port = 1433
encrypt = true
trust_server_certificate = false

For development environments or SQL Servers without encryption support, either omit the encryption parameters or set encrypt = false.

2. Amazon Redshift (IAM authentication)

[connections.source.redshift]
username = "demo-user"
database = "demo_db"
auth_method = "iam-provisioned-cluster"
cluster_id = "my-aws-cluster"
region = "us-west-2"
access_key_id = "your-access-key-id"
secret_access_key = "your-secret-access-key"

3. Amazon Redshift (standard authentication)

[connections.source.redshift]
username = "myuser"
password = "mypassword"
database = "mydatabase"
host = "my-cluster.abcdef123456.us-west-2.redshift.amazonaws.com"
port = 5439
auth_method = "standard"

4. Teradata

The Worker supports two Teradata drivers and automatically selects the best one available:

  1. teradatasql (preferred). Pure Python driver, no OS-level ODBC installation required. The Snowflake AIM Migration Agent or Teradata-enabled worker environment supplies this driver when configured.
  2. ODBC fallback. If teradatasql isn’t installed, the Worker falls back to pyodbc with the Teradata ODBC driver. Set driver_name to the exact name returned by pyodbc.drivers().

When teradatasql is available, driver_name is ignored and no ODBC driver needs to be installed on the host. Use dbc_name when your Teradata COP or TDPID alias differs from host.

[connections.source.teradata]
host = "your-teradata-host.example.com"
port = 1025
database = "tpcds"
username = "your_username"
password = "your_password"
# driver_name = "Teradata Database ODBC Driver 17.20"  # only needed for ODBC fallback
# dbc_name = "TDPID_ALIAS"  # optional; defaults to host

5. Oracle (standard authentication, basic / EZConnect)

Same shape as data migration Worker TOML. See SnowConvert AI: Data migration for a full Oracle example.

[connections.source.oracle]
oracle_connection_mode = "basic"
username = "scott"
password = "your_password"
database = "ORCL"
host = "db.example.com"
port = "1521"

6. PostgreSQL (standard authentication)

Same shape as data migration Worker TOML. See SnowConvert AI: Data migration for managed-host and local Docker examples.

[connections.source.postgresql]
auth_method = "standard"
username = "scai"
password = "your_password"
database = "analytics"
host = "db.example.com"
port = "5432"
ssl_mode = "Require"
use_copy = true

Command reference

The following table lists scai commands used for Cloud Data Validation. For worker TOML, workflow fields, and Snowflake observability, see Using SCAI CLI, Worker configuration, and Validation workflow configuration reference. Orchestrator and worker install paths are covered in the Snowflake AIM Migration Agent.

CommandPurpose
scai data validate generate-configGenerate a validation workflow file (YAML or JSON by extension) from the SnowConvert AI project.
scai data validate create-workflowCreate a cloud validation workflow; --config accepts .yaml, .yml, or .json.
scai data validate startLocal all-in-one: generate config if needed, start local Worker + Orchestrator, watch.
scai data validate statusShow or watch validation workflow status.
scai data validate listList validation workflows (--status, --limit, --connection).
scai data worker generate-config <path>Generate a Worker TOML from the project (see Generate and verify Worker configuration).
scai data worker start --local <config.toml>Start the Data Exchange Worker locally.
scai data worker start --auto-config [PATH]Emit a Worker TOML template (alternative to generate-config).
scai connection test -l <dialect> -s <profile> --jsonVerify source connectivity before starting Workers.
scai data orchestrator setup / stopDeploy or stop the SPCS Orchestrator service (shared with data migration).