Cortex Analyst semantic model specification¶

Why use semantic models?¶

Cortex Analyst allows users to query Snowflake data using natural language. However, business users often use language that is incompatible with the schema. While users specify domain-specific business terms in their questions, the underlying data is often stored using technical abbreviations. For example, “CUST” is often used for customers. This disconnect, combined with the schema’s lack of semantic context, makes it difficult for Cortex Analyst to provide accurate answers.

Semantic models map business terminology to database schemas and add contextual meaning. For example, when a user asks about “total revenue last month”, the semantic model can define “revenue” as net revenue and “last month” as the previous calendar month. This mapping helps Cortex Analyst understand the user’s intent and provide accurate answers.

Note

Semantic models are considered metadata.

Key concepts¶

Note

In this topic, database artifacts are called “physical” objects, while semantic model artifacts are called “logical” objects.

The structure and concepts of the semantic model are similar to those in database schemas, but semantic models allow you to provide more semantic information about your data.

Semantic layer concepts¶

The structure and concepts that define the logical semantic layer are similar to those that define the physical database layer. The following are the concept types for a semantic layer:

Logical table-level
Model-level
Additional context

Semantic model concepts showing semantic model at the top.

Logical table-level concepts¶

A logical table is a foundational concept of Snowflake’s semantic model. It represents either a physical database table or a view. It typically corresponds to business entities (such as customers, orders, or suppliers) or dimensions (such as location or time). Each row in a logical table typically represents a unique instance of the entity, such as a customer ID.

Logical tables contain the following kinds of columns:

Facts (quantitative data about business events)
Dimensions (who, what, where, and how)
Time (when the event occurred)
Filters associated with the logical table to allow query results to be limited to specific data subsets.

You can define metrics by using aggregations or combinations of other logical objects.

The following examples use the TPC-H schema (https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdf), which includes the LINEITEM fact table. For a complete YAML implementation, download the snow_tpch file.

TPC-H schema showing the relationship between the LINEITEM fact table and the DIMENSIONS tables.

All of the following logical table-level examples belong to the order_lineitems logical table.

tables:
  - name: order_lineitems
    description: >
      The order line items table contains detailed information about each item within an
      order, including quantities, pricing, and dates.

    base_table:
      database: SNOWFLAKE_SAMPLE_DATA
      schema: TPCH_SF1
      table: LINEITEM
    primary_key:
      columns:
        - order_key
        - order_lineitem_number

Copy

A dimension represents categorical data that provides context to facts, such as product, customer, or location information. Dimensions typically contain descriptive text values, such as product names or customer addresses. They are used to filter, group, and label facts in analyses and reports.

dimensions:
- name: customer_location
  synonyms:
    - "customer region"
    - "customer area"
  description: Customer location, combining nation and region names.
  expr: CONCAT(nation.N_NAME, ', ', region.R_NAME)
  data_type: TEXT
  unique: false

 cortex_search_service:
    service: customer_location_search
    literal_column: <string>
    database: temp
    schema: service
  is_enum: false

Copy

A time dimension provides temporal context for analyzing facts across different periods. It enables tracking metrics over specific time intervals (dates, months, years) and supports analyses such as trend identification and period-over-period comparisons.

time_dimensions:
  - name: shipment_duration
    synonyms:
      - "shipping time"
      - "shipment time"
    description: The time it takes for items to be shipped.

    expr: DATEDIFF(day, lineitem.L_SHIPDATE, lineitem.L_RECEIPTDATE)
    data_type: NUMBER
    unique: false

Copy

Facts are measurable, quantitative data that provide context for analyses. Facts represent numeric values related to business processes, such as sales, cost, or quantity. A fact is an unaggregated, row-level concept.

# Fact columns in the logical table.
facts:
  - name: net_revenue
    synonyms:
      - "revenue after discount"
      - "net sales"
    description: Net revenue after applying discounts.

    expr: lineitem.L_EXTENDEDPRICE * (1 - lineitem.L_DISCOUNT)
    data_type: NUMBER

Copy

For semantic views, facts can be made private (accessible in the semantic view, but not by end users) by setting the access_modifier field to private_access. By default, facts are public (accessible by users).

A filter is a condition that limits query results to specific data subsets based on criteria such as time period, location, or category.

- name: north_america
  synonyms:
    - "NA"
    - "North America region"
  description: >
    Filter to restrict data to orders from North America.
  comments: Used for analysis focusing on North American customers.
  expr: nation.N_NAME IN ('Canada', 'Mexico', 'United States')

Copy

A metric is a quantifiable measure of business performance expressed as a SQL formula. You can use metrics as key performance indicators (KPIs) in reports and dashboards. You can calculate two kinds of metrics:

Regular metrics aggregate values (using functions such as SUM or AVG) over a fact column.
Derived metrics are calculated from existing metrics, using arithmetic operations such as addition or division, or aggregations of dimensions or facts.

Derived metrics are supported only in semantic views.

Define metrics at their most granular level for aggregation at higher levels. For example, define total_revenue at the line item level to allow aggregation by customer, supplier, or region.

The following example shows two regular metric calculation: a simple one and a more complex one. These are defined inside a table definition, so they are scoped to that logical table.

metrics:

    # Simple metric referencing objects from the same logical table
    - name: total_revenue
      expr: SUM(lineitem.l_extendedprice * (1 - lineitem.l_discount))

    # Complex metric referencing objects from multiple logical tables.
    # The relationships between tables have been defined below.
    - name: total_profit_margin
      description: >
                  The profit margin from orders. This metric is not additive
                  and should always be calculated directly from the base tables.
      expr: (SUM(order_lineitems.net_revenue) -
            SUM(part_suppliers.part_supplier_cost * order_lineitems.lineitem_quantity))
            / SUM(order_lineitems.net_revenue)

Copy

Tip

For semantic views, metrics can be made private (accessible in the semantic view, but not by end users) by setting the access_modifier field to private_access. By default, metrics are public (accessible by users).

Allowed references¶

Expressions for dimensions, facts, filters, or regular table-scoped metrics can reference:

Physical columns from their own base table
Logical columns within the same logical table
Logical columns from other logical tables in the semantic model

For semantic views, derived metrics can reference the following items in an expression:

Aggregations of dimensions and facts defined in any logical table in the semantic view.
Scalar expressions of metrics defined in any logical table in the semantic view.
Other derived metrics.

Note

Expressions can’t reference physical columns from other physical tables.

Model level concepts¶

Relationships connect logical tables through joins on shared keys. For example, there could be a relationship between the customers and orders tables through a join on the customer_id column. You can use joins to analyze order data with customer attributes.

relationships:

  # Relationship between orders and lineitems
  - name: order_lineitems_to_orders
    left_table: order_lineitems
    right_table: orders
    relationship_columns:
      - left_column: order_key
        right_column: order_key
    # For semantic views, do not specify
    # join_type or relationship_type.
    join_type: left_outer
    relationship_type: many_to_one

  # Relationship between lineitems and partsuppliers
  - name: order_lineitems_to_part_suppliers
    left_table: order_lineitems
    right_table: part_suppliers
    # The relationship requires equality of multiple columns from each table
    relationship_columns:
      - left_column: part_key
        right_column: part_key
      - left_column: supplier_key
        right_column: supplier_key
    # For semantic views, do not specify
    # join_type or relationship_type.
    join_type: left_outer
    relationship_type: many_to_one

Copy

A verified query repository (VQR) is a collection of questions and corresponding SQL queries that are verified to be correct. You can use the queries to help improve the accuracy of Cortex Analyst’s’ results.

You can use Custom instructions in Cortex Analyst to have greater control over Cortex Analyst’s SQL query generation. Within the custom instructions, you provide the unique context of your business to the LLM.

Cortex Analyst semantic models are specified in YAML (https://yaml.org/spec/). The model provides the necessary semantic information to answer natural language questions with high precision.

YAML format¶

YAML is balanced between human readability and precision. Business users can understand it while data engineers and analysts can define technical concepts clearly.

The general syntax of a Cortex Analyst semantic model is:

# Name and description of the semantic model.
name: <name>
description: <string>
comments: <string>

# Logical table-level concepts

# A semantic model can contain one or more logical tables.
tables:

  # A logical table on top of a base table.
  - name: <name>
    description: <string>


    # The fully qualified name of the base table.
    base_table:
      database: <database>
      schema: <schema>
      table: <base table name>

    # Dimension columns in the logical table.
    dimensions:
      - name: <name>
        synonyms: <array of strings>
        description: <string>

        expr: <SQL expression>
        data_type: <data type>
        unique: <boolean>
        cortex_search_service:
          service: <string>
          literal_column: <string>
          database: <string>
          schema: <string>
        is_enum: <boolean>


    # Time dimension columns in the logical table.
    time_dimensions:
      - name: <name>
        synonyms:  <array of strings>
        description: <string>

        expr: <SQL expression>
        data_type: <data type>
        unique: <boolean>

    # Fact columns in the logical table.
    facts:
      - name: <name>
        synonyms: <array of strings>
        description: <string>
        access_modifier: < public_access | private_access >  # Supported only for semantic views.
                                                             # Default is public_access.

        expr: <SQL expression>
        data_type: <data type>


    # Regular metrics scoped to the logical table.
    metrics:
      - name: <name>
        synonyms: <array of strings>
        description: <string>
        access_modifier: < public_access | private_access >  # Supported only for semantic views.
                                                             # Default is public_access.

        expr: <SQL expression>


    # Commonly used filters over the logical table.
    filters:
      - name: <name>
        synonyms: <array of strings>
        description: <string>

        expr: <SQL expression>

# Model-level concepts

# Relationships between logical tables
relationships:
  - name: <string>

    left_table: <table>
    right_table: <table>
    relationship_columns:
      - left_column: <column>
        right_column: <column>
      - left_column: <column>
        right_column: <column>
    # For semantic views, do not specify
    # join_type or relationship_type.
    join_type: <left_outer | inner>
    relationship_type: < one_to_one | many_to_one >

# Derived metrics scoped to the semantic view.
# Derived metrics are supported only for semantic views.
metrics:
  - name: <name>
    synonyms: <array of strings>
    description: <string>
    access_modifier: < public_access | private_access >  # Default is public_access

    expr: <SQL expression>


# Additional context concepts

#  Verified queries with example questions and queries that answer them
verified_queries:
  - name:        # A descriptive name of the query.

    question:    # The natural language question that this query answers.
    verified_at: # Optional: Time (in seconds since the UNIX epoch, January 1, 1970) when the query was verified.
    verified_by: # Optional: Name of the person who verified the query.
    use_as_onboarding_question:  # Optional: Marks this question as an onboarding question for the end user.
    sql:                         # The SQL query for answering the question

Copy

For an example specification, see the snow_tpch.yaml file.

Create a semantic view using the AI-assisted model generator¶

Use the AI-assisted generator to create a semantic model that combines semantic information from multiple sources. For more information about the AI-assisted generator, see Using the AI-assisted generator to create a semantic view.

To create the model, complete the following steps:

Sign in to Snowsight.
In the navigation menu, select AI & ML » Cortex Analyst.
In the title bar, select Create new » Create new Semantic Model.
Select a location to store the semantic model after creation.
Enter a name for the semantic model.
This name automatically populates the File name field.
For Description, specify information about the semantic model.
Optional: Enter a different file name.
Select Next.
Optional: To provide context, for SQL Queries, provide example questions and their respective SQL queries that you want to use as part of the model.

Note

The AI-assisted generator runs EXPLAIN on these queries, so these queries should return actual data. If they do not, Snowflake doesn’t guarantee proper performance.

For Select tables, provide the data source that you’re using to create the semantic model. You must provide at least one table or view. There’s no limit on the tables or views that you can specify, but we recommend not using more than 10 for the semantic model.
Select Next.
For Select columns, select the columns that you’re using to create the semantic model. You can select all the columns or specific columns. For performance reasons, we recommend not using more than 50 columns.
Select whether you want to add sample values from each column to the semantic model.

Sample values help improve the accuracy of Cortex Analyst’s results.
Select whether you want to add AI-generated descriptions for tables and columns to the semantic model.

The AI-generated descriptions are based on the column names and sample values.
Select Create and save.

You can view the progress of the model generation, including details about the steps that the model generator is taking, on the semantic model page. The process can take a few minutes.

To make further modifications, edit the model by either using Snowsight or editing the YAML file directly.

Cortex Analyst automatically generates suggestions to improve the semantic model after creation. Cortex Analyst uses the query history accessible by the role used to create the semantic model to generate both relationships and verified query suggestions. It might take some time for the suggestions to appear. After the suggestions appear, you can review them and apply them to the model as needed.

Open an existing semantic model¶

After you’ve created a semantic model, you can open it in Snowsight. To open a semantic model, do the following:

Sign in to Snowsight.
In the navigation menu, select AI & ML » Cortex Analyst.
On the Semantic models tab, select the database, schema, and stage where the semantic model was saved.
From the list of semantic models, select the semantic model that you want to open.
Select Open.

Note

If you don’t see your semantic model within your stage, try refreshing the list of models, not the page.

Tips for creating a semantic model¶

Organize your YAML file by business domain or topic
- Structure your YAML files to align with specific business domains or topics, keeping the scope focused. For instance, create separate semantic models for sales analytics and marketing analytics.
- Tailor your use cases based on the target audience, expected questions or KPIs, and required data. Well-defined use cases lead to richer semantic models and more effective data retrieval.
Think from the end-user perspective
- Identify key questions that users are likely to ask about the topic and include only the tables and columns needed to answer those questions.
- Use names and synonyms that are similar to the vocabulary used by your end users.
- Include important details in description fields that would be helpful to someone writing queries on this dataset for the first time, such as the timezone of a DATETIME column.
Capture complex calculations

Incorporate more difficult or business-specific queries into expressions.
Use wide tables instead of long tables

If you have a table with columns like “metric” and “value”, flatten the table so that each metric is a column. This approach provides the model with more semantic information on each metric.
Review auto-generated descriptions

If you’re using the semantic model generator (https://github.com/Snowflake-Labs/semantic-model-generator), it attempts to automatically generate descriptions for your tables and columns. Always review these descriptions to ensure that they are reasonable and relevant; make modifications as needed.
Start simple and expand gradually

A well-scoped semantic file ensures higher precision and accuracy of results. Start with a small number of tables and columns and expand the semantic model YAML gradually to cover more kinds of questions. Remember that YAML building is a continuous process.
Include verified queries

A verified query repository (VQR), which is a collection of plain-English questions and queries that answer them, can help improve the accuracy and trustworthiness of results.

Known limitations¶

Cortex Analyst imposes a 2 MB size limit on the semantic model or semantic view to restrict the size of API inputs.

Specification¶

This section contains detailed specifications for the key concepts described in the preceding sections.

Semantic Model¶

A semantic model represents a collection of tables. The model contains descriptions of tables, each of which contains descriptions of specifics aspects of the table. Each table described in the model maps to a physical base table in Snowflake.

It has the following fields:

Required

name

A descriptive name for this semantic model.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

Optional

These fields are not required, but should be included whenever possible.

description: A description of this semantic model, including details of what kind of analysis it’s useful for.
tables: A list of logical tables in this semantic model.
relationships: A list of joins between logical tables.

Logical table¶

You can think of a logical table as a view over a physical database table or view. It has the following fields:

Required

name

A descriptive name for this table.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

base_table

A fully qualified name of the underlying base table in the database.

Optional

These fields are not required, but should be included whenever possible.

synonyms: A list of other terms/phrases used to refer to this table. It must be unique across synonyms within the logical table.
description: A description of this table.
primary_key: The primary key columns for this table. Required if you’re defining relationnships.
dimensions: A list of dimension columns in this table.
time_dimensions: A list of time dimension columns in this table.
facts: A list of fact columns in this table.
metrics: A list of metrics in this table.
filters: Predefined filters on this table, if any.

Dimension¶

A dimension describes categorical values such as state, user_type, platform, etc. It has the following fields:

Required

name

A descriptive name for this dimension.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

expr

The SQL expression for this dimension. This could be a reference to a physical column or a SQL expression with one or more columns from the underlying base table.

data_type

The data type of this dimension. For an overview of all data types in Snowflake see SQL data types reference. Note that VARIANT, OBJECT, GEOGRAPHY, and ARRAY are currently not supported.

Optional

These fields are not required, but should be included whenever possible.

synonyms

A list of other terms/phrases used to refer to this dimension. Must be unique across all synonyms in this semantic model.

description

A brief description about this dimension, including what data it has.

unique

A boolean value that indicates this dimension has unique values.

sample_values

Sample values of this column, if any. Add any value that is likely to be referenced in the user questions.

is_enum

A Boolean value. If True, the values in the sample_values field are taken to be the full list of possible values, and the model only chooses from those values when filtering on that column.

cortex_search_service

Specifies the Cortex Search Service to use for this dimension. It has the following fields:

service: The name of the Cortex Search Service.
literal_column: (optional) The column in the Cortex Search Service that contains the literal values.
database: (optional) The database where the Cortex Search Service is located. Defaults to base_table’s database.
schema: (optional) The schema where the Cortex Search Service is located. Defaults to base_table’s schema.

cortex_search_service replaces the cortex_search_service_name field, which could only specify the name. cortex_search_service_name has been deprecated.

Time Dimension¶

A time dimension describes time values, such as sale_date, created_at, and year. It has the following fields:

Required

name

A descriptive name for this time dimension.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

expr

The SQL expression for this column. This could be a reference to a physical column or a SQL expression with one or more columns from the underlying base table.

data_type

The data type of this time dimension. For an overview of all data types in Snowflake see SQL data types reference. Note that VARIANT, OBJECT, GEOGRAPHY, and ARRAY are currently not supported.

Optional

These fields are not required, but should be included whenever possible.

synonyms: A list of other terms/phrases used to refer to this time dimension. Must be unique across all synonyms in this semantic model.
description: A brief description about this dimension, including what data it has. Provide information that’ll help someone writing queries using this table. For example, for DATETIME columns, specify the timezone of the data.
unique:: A boolean value that indicates this column has unique values.
sample_values:: Sample values of this column, if any. Add any values that are likely to be referenced in the user questions. This field is optional.

Fact¶

A fact describes numerical values, such as revenue, impressions, and salary. Facts used to be called measures in earlier releases of Cortex Analyst. Facts are backward compatible with measures. Facts have the following fields:

Required

name

A descriptive name for this fact.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

expr

This SQL expression can refer to either a physical column in the same logical table’s base physical table or a logical column (fact, dimension, or time dimension) within that logical table.

data_type

The data type of this fact. For an overview of all data types in Snowflake see SQL data types reference. Note that VARIANT, OBJECT, GEOGRAPHY, and ARRAY are currently not supported.

Optional

The following fields are not required, but should be included whenever possible.

synonyms: A list of other terms/phrases used to refer to this measure. Must be unique across all synonyms in this semantic model.
description: A brief description about this measure, including what data this column has.
unique: A boolean value that indicates this column has unique values.
sample_values: Sample values of this column, if any. Add any values that are likely to be referenced in the user questions.

The following field is optional. If it is not included, it defaults to public_access.

access_modifier: Supported only for semantic views. The possible values are public_access and private_access. If this is set to private_access, the fact is accessible in the semantic view, but not by end users.

Filter¶

A filter represents a SQL expression that’s used for filtering. It has the following fields:

Required

name: A descriptive name for this filter.
expr: The SQL expression of this filter, referencing logical columns.

Optional

These fields are not required, but should be included whenever possible.

synonyms: A list of other terms/phrases used to refer to this filter. Must be unique across all synonyms in this semantic model.
description: A brief description about this filter, including details of what this filter is typically used for.

Metric¶

A metric describes quantifiable measures of business performance, such as total revenue, average order value, or customer count. Metrics can be regular metrics or derived metrics.

Regular metrics are scoped to a logical table and can reference logical columns (facts, dimensions, or time dimensions) within the same logical table, or logical columns from other logical tables in the semantic model. In the semantic model, the metric element is inside a table element. Regular metrics use aggregate and window functions to calculate overall values for the table.
Derived metrics (which are supported only for semantic views) are scoped to the semantic view and provide a way of combining other metrics. In the semantic model, the metric element is inside the semantic_model element. In the expression for the metric, you can refer only to other metrics (including other derived metrics) or aggregations of dimensions or facts.

Metrics have the following fields.

Required

name

A descriptive name for this metric. This name deterines whether it is a regular metric.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

expr

The SQL expression for this column. The format of the expression depends on whether the metric is a regular metric or a derived metric.

Regular metrics, which are scoped to a table, reference a logical column (fact, dimension, or time dimension) in the same logical table or a logical column from another logical table in the semantic model and use aggregate or window functions to calculate overall values.
Derived metrics can use expressions that include:
- Scalar expressions that can reference regular metrics or other derived metrics.
- Aggregations of dimensions or facts.

Optional

The following fields are not required, but should be included whenever possible.

synonyms: A list of other terms/phrases used to refer to this metric. Each synonym must be unique across all synonyms in the semantic model.
description: A brief description of this metric, including what data this column has.
sample_values: Sample values of this column, if any. Add any values that are likely to be referenced in the user questions.

The following field is optional. If it is not included, it defaults to public_access.

access_modifier: Supported only for semantic views. The possible values are public_access or private_access. If this is set to private_access, the metric is accessible in the semantic view, but not by end users.

Base Table¶

A base table is used to represent fully qualified table names. This is the physical table that a logical table maps to. It has the following fields:

Required

database: Name of the database.
schema: Name of the schema.
table: Name of the table.

Primary Key¶

A primary key represents the columns that uniquely represent each row of the table. The primary key is required if the table is used in a relationship. It has the following fields:

Required

columns: A list of dimension columns uniquely representing the table.

Relationships¶

Defines join relationships between logical tables. To ensure proper join functionality, primary keys must be defined for the tables involved in relationships. It has the following fields:

Required

name

A unique identifier for the relationship.

left_table

Logical table name as defined earlier in your YAML file. For many-to-one relationships, the left table should be the many side of the relationship for optimal performance.

right_table

Logical table name as defined earlier in your YAML file. For many-to-one relationships, the right table must be the one side of the relationship for optimal performance.

relationship_columns

A list of equal columns from each of the left table and right table representing the join path.

join_type

Either left_outer or inner.

Do not specify this for semantic views.

relationship_type

Either many_to_one or one_to_one.

Do not specify this for semantic views. Instead, use unique or primary keys in the logical tables to indicate that the relationship should be many-to-one or one-to-one.

Verified Queries¶

See Cortex Analyst Verified Query Repository for information on the purpose and structure of this section of the YAML file.

Custom instructions¶

For information about custom instructions, see Custom instructions in Cortex Analyst.