Cortex LLM REST API

Snowflake Cortex LLM Functions provide natural language processing features powered by a variety of large language models (LLMs) using SQL or Python. For more information, see Large Language Model (LLM) Functions (Snowflake Cortex).

You can use the Snowflake Cortex LLM REST API to invoke inference with the LLM of your choice. You can make requests using any programming language that can make HTTP POST requests. This functionality allows you to bring state-of-the-art AI functionality to your applications. Using this API doesn’t require a warehouse.

The Cortex LLM REST API streams generated tokens back to the user as server-sent events (https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events).

Cost considerations

Snowflake Cortex LLM REST API requests incur compute costs based on the number of tokens processed. Refer to the Snowflake Service Consumption Table for each function’s cost in credits per million tokens. A token is the smallest unit of text processed by Snowflake Cortex LLM functions, approximately equal to four characters of text. The equivalence of raw input or output text to tokens can vary by model.

The COMPLETE function generates new text given an input prompt. Both input and output tokens incur compute cost. If you use COMPLETE to provide a conversational or chat user experience, all previous prompts and responses are processed to generate each new response, with corresponding costs.

Model availability

The following table contains the models that are available in the Cortex LLM REST API.

Function
(Model)
AWS US West 2
(Oregon)
AWS US East 1
(N. Virginia)
AWS Europe Central 1
(Frankfurt)
AWS Europe West 1
(Ireland)
AWS AP Southeast 2
(Sydney)
AWS AP Northeast 1
(Tokyo)
Azure East US 2
(Virginia)
Azure West Europe
(Netherlands)
AWS
(Cross-Region)
COMPLETE
(claude-3-5-sonnet)

COMPLETE
(llama3.1-8b)

COMPLETE
(llama3.1-70b)

COMPLETE
(llama3.1-405b)

COMPLETE
(llama3.2-1b)

COMPLETE
(llama3.2-3b)

COMPLETE
(deepseek-r1)

COMPLETE
(mistral-7b)

COMPLETE
(mistral-large)

COMPLETE
(mistral-large2)

COMPLETE
(snowflake-llama-3.3-70b)

You can also use any fine-tuned model in any supported region.

Usage quotas

To ensure high performance standards for all Snowflake customers, Snowflake Cortex LLM REST API requests are subject to usage quotas. Requests exceeding the quotas might be throttled. Snowflake might occassionally adjust these quotas. The quotas in the following table are applied per account and are applied independently for each model.

Function
(Model)
Tokens Processed
per Minute (TPM)
Requests per
Minute (RPM)
Max output (tokens)
COMPLETE
(deepseek-r1)

100,000

50

4,096

COMPLETE
(llama3.1-8b)

400,000

200

4,096

COMPLETE
(llama3.1-70b)

200,000

100

4,096

COMPLETE
(llama3.1-405b)

100,000

50

4,096

COMPLETE
(mistral-7b)

400,000

200

4,096

COMPLETE
(mistral-large2)

200,000

100

4,096

COMPLETE endpoint

The /api/v2/cortex/inference:complete endpoint executes the SQL COMPLETE function. It takes the form:

POST https://<account_identifier>.snowflakecomputing.cn/api/v2/cortex/inference:complete

where account_identifier is the account identifier you use to access Snowsight.

Note

Currently, only the COMPLETE function is supported. Additional functions may be supported in a future version of the Cortex LLM REST API.

Setting up authentication

Authenticating to the Cortex LLM REST API uses key-pair authentication. This requires creating an RSA key pair and assigning its public key to a user, which must be done using the SECURITYADMIN role (or another role that has had SECURITYADMIN granted, such as ACCOUNTADMIN). For step-by-step instructions, see Configuring key-pair authentication.

Tip

Consider creating a dedicated user for Cortex LLM REST API requests.

To make API requests, use the private key to create a JSON Web token (https://jwt.io/) (JWT) and pass it in the headers of the request.

Setting up authorization

Once you have created a key pair and assigned its public key to a user, that user’s default role needs to have the snowflake.cortex_user database role, which contains the privileges to use the LLM functions. In most cases, users already have this privilege, because it is granted to the PUBLIC role automatically, and all roles inherit PUBLIC.

If your Snowflake administrator prefers to opt in individual users, he or she might have revoked snowflake.cortex_user from PUBLIC, and must grant this role to the users who should be able to use the Cortex LLM REST API as follows.

GRANT DATABASE ROLE snowflake.cortex_user TO ROLE MY_ROLE;
GRANT ROLE MY_ROLE TO USER MY_USER;
Copy

Important

REST API requests use the user’s default role, so that role must have the necessary privileges. You can change a user’s default role with ALTER USER … SET DEFAULT ROLE.

ALTER USER MY_USER SET DEFAULT_ROLE=MY_ROLE
Copy

Submitting requests

You make a request to the Cortex LLM REST API by POSTing to the API’s REST endpoint. The Authorization header must contain a JSON Web token generated from your public key, which you can do using snowsql via the following command. The generated JWT expires after one hour.

snowsql -a <account_identifier> -u <user> --private-key-path <path>/rsa_key.p8 --generate-jwt
Copy

The body of the request is a JSON object that specifies the model, the prompt or conversation history, and options. See the following API Reference for details.

API Reference

POST /api/v2/cortex/inference:complete

Completes a prompt or conversation using the specified large language model. The body of the request is a JSON object containing the arguments.

This endpoint corresponds to the COMPLETE SQL function.

Required headers

X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT

Defines the type of authorization token.

Authorization: Bearer jwt.

Authorization for the request. jwt is a valid JSON Web token.

Content-Type: application/json

Specifies that the body of the request is in JSON format.

Accept: application/json, text/event-stream

Specifies that the response will either contain JSON (error case) or server-sent events.

Required JSON arguments

Argument

Type

Description

model

string

The identifier of the model to use (see Choosing a model). For possible values, see Model availability.

Alternatively, you may use the fully-qualified name of any fine-tuned model in the format database.schema.model.

Note

claude-3-5-sonnet is not available with COMPLETE Structured Outputs.

messages

array

The prompt or conversation history to be used to generate a completion. An array of objects representing a conversation in chronological order. Each object must contain a content key and may also contain a role key.

  • content: A string containing a system message, a prompt from the user, or a previous response from the model.

  • role: A string indicating the role of the message, one of 'system', 'user', or 'assistant'.

See the COMPLETE roles table for a more detailed description of these roles.

For prompts consisting of a single user message, role may be omitted; it is then assumed to be user.

Optional JSON arguments

Argument

Type

Default

Description

top_p

number

1.0

A value from 0 to 1 (inclusive) that controls the diversity of the language model by restricting the set of possible tokens that the model outputs.

temperature

number

0.0

A value from 0 to 1 (inclusive) that controls the randomness of the output of the language model by influencing which possible token is chosen at each step.

max_tokens

integer

4096

A value between 1 and 4096 (inclusive) that controls the maximum number of tokens to output. Output is truncated after this number of tokens.

Tools configuration

The tool_spec field has an array of available tools. The following table shows the available fields:

Tool specifications

Field

Type

Description

tool_spec.type

string

The type of tool. A combination of type and name is a unique identifier.

tool_spec.name

string

The name of the tool. A combination of type and name is a unique identifier.

tool_spec.description

string

A description of the tool being considered for tool use.

tool_spec.input_schema

object

The JSON schema for the tool input.

tool_spec.input_schema.type

string

The type of the input schema object.

tool_spec.input_schema.properties

object

The definitions for each input parameter.

tool_spec.input_schema.required

array

The list of required input parameters.

The tool_choice field configures the tool selection behavior. It has the following fields:

Tool choice

Field

Type

Description

type

string

The manner in which the tools are selected.

Valid values:

  • "auto": The model decides whether or not to call the tools that you’ve provided.

  • "required": Use one or more tools.

  • "tool": Use the tools that you’ve specified.

name

array

The names of the tool being used. Only valid when type is "tool".

tool_use represents a model’s request to use a specific tool. It contains the tool identifier and input parameters for the execution. It has the following fields:

Tool use

Field

Type

Description

type

string

Identifies this as a tool use request.

tool_use

object

Container for tool use request details.

Each object contains the following keys:

Tool use request keys

Field

Type

Description

Required

tool_use_id

string

Unique identifier for this tool use request.

Yes

name

string

The name of the tool being used.

Yes

input

object

The name of the tool being used.

Yes

Tool results

Represents the results of a tool execution. Contains both the input parameters and output results from the tool execution.

Tool results

Field

Type

Description

type

string

Identifies this as a tool result.

tool_results

string

Unique identifier for this tool use request.

Each result can contain the following keys:

Tool results keys

Field

Type

Description

Required

tool_use_id

string

Unique identifier linking this result to its corresponding tool use request.

Yes

name

string

The name of the tool that was run. It must match the tool name from the tools array.

Yes

content

array

Array of content elements produced by the tool execution.

Yes

id

string

Optional identifier for the tool execution instance.

No

status

string

Status indicators for the tool execution.

No

Results

object

Additional result data in an arbitrary structure. Additional properties is set to True.

Yes

Output

Tokens are sent as they are generated using server-sent events (SSEs). Each SSE event uses the message type and contains a JSON object with the following structure.

Key

Value type

Description

'id'

string

Unique ID of the request, the same value for all events sent in response to the request.

'created'

number

UNIX timestamp (seconds since midnight, January 1, 1970) when the response was generated.

'model'

string

Identifier of the model.

'choices'

array

The model’s responses. Each response is an object containing a 'delta' key whose value is an object, whose 'content' key contains the new tokens generated by the model. Currently, only one response is provided.

Status codes

The Snowflake Cortex LLM REST API uses the following HTTP status codes to indicate successful completion or various error conditions.

200 OK

Request completed successfully. The body of the response contains the output of the model.

400 invalid options object

The optional arguments have invalid values.

400 unknown model model_name

The specified model does not exist.

400 schema validation failed

Errors related to incorrect response schema structure. Correct the schema and try again.

400 max tokens of count exceeded

The request exceeded the maximum number of tokens supported by the model (see Model restrictions).

400 all requests were throttled by remote service

The request has been throttled due to a high level of usage. Try again later.

402 budget exceeded

The model consumption budget was exceeded.

403 Not Authorized

Account not enabled for REST API, or the default role for the calling user does not have the snowflake.cortex_user database role.

429 too many requests

The request was rejected because the usage quota has been exceeded. Please try your request later.

503 inference timed out

The request took too long.

Basic example

The following example uses curl to make a COMPLETE request. Replace jwt, prompt, and account_identifier with the appropriate values in this command.

curl -X POST \
    -H 'X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT' \
    -H "Authorization: Bearer <jwt>" \
    -H 'Content-Type: application/json' \
    -H 'Accept: application/json, text/event-stream' \
    -d '{
    "model": "mistral-large",
    "messages": [
        {
            "content": "<prompt>"
        }
    ],
    "top_p": 0,
    "temperature": 0
    }' \
https://<account_identifier>.snowflakecomputing.cn/api/v2/cortex/inference:complete
Copy

Output

data: {
data:  "id": "65c5e2ac-529b-461e-8a8c-f80655e6bd3f",
data:  "created": 1723493954,
data:  "model": "mistral-7b",
data:  "choices": [
data:    {
data:      "delta": {
data:        "content": "Cor"
data:        }
data:      }
data:     ],
data:  "usage": {
data:    "prompt_tokens": 57,
data:    "completion_tokens": 1,
data:    "total_tokens": 58
data:  }
data: }

data: {
data:  "id": "65c5e2ac-529b-461e-8a8c-f80655e6bd3f",
data:  "created": 1723493954,
data:  "model": "mistral-7b",
data:  "choices": [
data:    {
data:      "delta": {
data:        "content": "tex"
data:        }
data:      }
data:     ],
data:  "usage": {
data:    "prompt_tokens": 57,
data:    "completion_tokens": 2,
data:    "total_tokens": 59
data:  }
data: }

Tool calling with chain of thought example

The following example uses curl to make COMPLETE requests in a chain of thought process. In this case, the tool is used to get the weather information for San Francisco, CA.

Request

curl -X POST \
    -H 'X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT' \
    -H "Authorization: Bearer <jwt>" \
    -H 'Content-Type: application/json' \
    -H 'Accept: application/json, text/event-stream' \
    -d '{
    "model": "claude-3-5-sonnet",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather like in San Francisco?"
      }
    ],
    "tools": [
      {
        "tool_spec": {
          "type": "generic",
          "name": "get_weather",
          "input_schema": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              }
            },
            "required": [
              "location"
            ]
          }
        }
      }
    ],
    "max_tokens": 4096,
    "top_p": 1,
    "stream": true
    }' \
https://<account_identifier>.snowflakecomputing.cn/api/v2/cortex/inference:complete
Copy

Response

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content":"I","content_list":[{"text":"I","type":"text"}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content":"'ll","content_list":[{"text":"'ll","type":"text"}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content":" help you","content_list":[{"text":" help you","type":"text"}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content":" check the","content_list":[{"text":" check the","type":"text"}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content":" weather in","content_list":[{"text":" weather in","type":"text"}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content":" San Francisco.","content_list":[{"text":" San Francisco.","type":"text"}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content_list":[{"name":"get_weather","tool_use_id":"tooluse_Iwuh-FEeTC-Iefsxu2ueKQ"}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content_list":[{"input":"{\"location\""}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content_list":[{"input":": \"San"}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content_list":[{"input":" Francisco"}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{"content_list":[{"input":", CA\"}"}]}}],"usage":{}}

data: {"id":"78fe5630-95b1-4960-ac2b-7eb85536b08e","model":"claude-3-5-sonnet","choices":[{"delta":{}}],"usage":{"prompt_tokens":390,"completion_tokens":53,"total_tokens":443}}

The user executes the get_weather tool on their end and provides the results to the model.

Follow-up Request

curl -X POST \
    -H 'X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT' \
    -H "Authorization: Bearer <jwt>" \
    -H 'Content-Type: application/json' \
    -H 'Accept: application/json, text/event-stream' \
    -d '{
    "model": "claude-3-5-sonnet",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather like in San Francisco?"
      },
      {
        "role": "assistant",
        "content": "I'll help you check the weather in San Francisco.",
        "content_list": [
          {
            "type": "tool_use",
            "tool_use": {
              "tool_use_id": "tooluse_Iwuh-FEeTC-Iefsxu2ueKQ",
              "name": "get_weather",
              "input": {
                "location": "San Francisco, CA"
              }
            }
          }
        ]
      },
      {
        "role": "user",
        "content": "What is the weather like in San Francisco?",
        "content_list": [
          {
            "type": "tool_results",
            "tool_results": {
              "tool_use_id": "tooluse_Iwuh-FEeTC-Iefsxu2ueKQ",
              "name": "get_weather",
              "content": [
                {
                  "type": "text",
                  "text": "\"temperature\": \"69 fahrenheit\""
                }
              ]
            }
          }
        ]
      }
    ],
    "tools": [
      {
        "tool_spec": {
          "type": "generic",
          "name": "get_weather",
          "input_schema": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              }
            },
            "required": [
              "location"
            ]
          }
        }
      }
    ],
    "max_tokens": 4096,
    "top_p": 1,
    "stream": true
    }' \
https://<account_identifier>.snowflakecomputing.cn/api/v2/cortex/inference:complete
Copy

Final Response

data: {"id":"07ffa851-4c47-4cea-9b7e-017a4cddc21d","model":"claude-3-5-sonnet","choices":[{"delta":{"content":"\n\nBase","content_list":[{"type":"text","text":"\n\nBase"}]}}],"usage":{}}

data: {"id":"07ffa851-4c47-4cea-9b7e-017a4cddc21d","model":"claude-3-5-sonnet","choices":[{"delta":{"content":"d on the weather data,","content_list":[{"type":"text","text":"d on the weather data,"}]}}],"usage":{}}

data: {"id":"07ffa851-4c47-4cea-9b7e-017a4cddc21d","model":"claude-3-5-sonnet","choices":[{"delta":{"content":" it's currently 69 ","content_list":[{"type":"text","text":" it's currently 69 "}]}}],"usage":{}}

data: {"id":"07ffa851-4c47-4cea-9b7e-017a4cddc21d","model":"claude-3-5-sonnet","choices":[{"delta":{"content":"degrees Fahrenheit in San Francisco,","content_list":[{"type":"text","text":"degrees Fahrenheit in San Francisco,"}]}}],"usage":{}}

data: {"id":"07ffa851-4c47-4cea-9b7e-017a4cddc21d","model":"claude-3-5-sonnet","choices":[{"delta":{"content":" CA.","content_list":[{"type":"text","text":" CA."}]}}],"usage":{}}

data: {"id":"07ffa851-4c47-4cea-9b7e-017a4cddc21d","model":"claude-3-5-sonnet","choices":[{"delta":{}}],"usage":{"prompt_tokens":466,"completion_tokens":26,"total_tokens":492}}

Python API

To install the Python API, use:

pip install snowflake-ml-python
Copy

The Python API is included in the snowflake-ml-python package starting with version 1.6.1.

Example

To use the Python API, first create a Snowflake session (see Creating a Session for Snowpark Python). Then call the Complete API. The REST back end is used only when stream=True is specified.

from snowflake.snowpark import Session
from snowflake.cortex import Complete

session = Session.builder.configs(...).create()

stream = Complete(
  "mistral-7b",
  "What are unique features of the Snowflake SQL dialect?",
  session=session,
  stream=True)

for update in stream:
  print(update)
Copy

Note

The streaming mode of the Python API currently doesn’t work in stored procedures and in Snowsight.

Language: English