Cortex REST API¶
The Cortex REST API gives you access to leading frontier models from Anthropic, OpenAI, Meta, Mistral, and more through your preferred endpoint or SDK. All inference runs within the Snowflake perimeter, so your data remains secure and within your governance boundary. See below on how to get started.
Choose your API¶
Cortex REST API supports two industry-standard API specifications. Pick the one that best fits your stack:
Chat Completions API |
Messages API |
|
|---|---|---|
Compatibility |
OpenAI Chat Completions API (https://platform.openai.com/docs/api-reference/chat/create) |
Anthropic Messages API (https://docs.anthropic.com/en/api/messages) |
Endpoint |
|
|
Supported models |
All models (OpenAI, Claude, Llama, Mistral, DeepSeek, Snowflake) |
Claude models only |
SDK support |
OpenAI Python and JavaScript SDKs |
Anthropic Python SDK |
Best for |
Most use cases; multi-model flexibility |
Existing Anthropic integrations; Anthropic API parity |
Both APIs share the same authentication, model catalog, and rate limits. The only difference is the request/response format and which models each endpoint supports. For pricing, see the Snowflake Service Consumption Table.
Quickstart¶
Prerequisites¶
Before you begin, you need:
Chat Completions quickstart¶
The Chat Completions API follows the OpenAI specification. You can use the OpenAI SDK directly.
In the preceding examples, replace the following:
<account-identifier>: Your Snowflake account identifier.<SNOWFLAKE_PAT>: Your Snowflake Programmatic Access Token (PAT).model: The model name. See 模型可用性 for supported models.
Messages API quickstart¶
The Messages API follows the Anthropic specification and supports Claude models only.
The Anthropic SDK sends credentials via x-api-key by default, but Snowflake expects a Bearer token.
Use an httpx client to set the correct authorization header.
Like Python, override the default auth header with a Bearer token via defaultHeaders.
In the preceding examples, replace the following:
<account-identifier>: Your Snowflake account identifier.<SNOWFLAKE_PAT>: Your Snowflake Programmatic Access Token (PAT).model: The Claude model name. See 模型可用性 for supported models.
设置身份验证¶
要向 Cortex REST API 进行身份验证,可以使用 使用 Snowflake 对 Snowflake REST APIs 进行身份验证 中所述的方法。
Set the Authorization header to include your token (for example, a JSON web token (JWT), OAuth token, or
programmatic access token).
小技巧
考虑为 Cortex REST API 请求创建专用用户。
模型可用性¶
The following tables show the models available in the Cortex REST API for each region:
模型
|
跨云
(任何区域)
|
AWS Global
(跨区域)
|
AWS US
(跨区域)
|
AWS EU
(跨区域)
|
AWS APJ
(跨区域)
|
Azure Global
(跨区域)
|
Azure US
(跨区域)
|
Azure EU
(跨区域)
|
|---|---|---|---|---|---|---|---|---|
|
✔ |
✔ |
✔ |
✔ |
||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
✔ |
✔ |
||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
✔ |
|||||
|
✔ |
✔ |
✔ |
✔ |
||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
* |
* |
* |
* |
||||
|
* |
* |
||||||
|
* |
* |
||||||
|
✔ |
|||||||
|
* |
|||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
||||||
|
✔ |
✔ |
模型
|
AWS US 西部 2
(俄勒冈)
|
AWS US 东部 1
(弗吉尼亚北部)
|
Azure 东部 US 2
(弗吉尼亚)
|
|---|---|---|---|
|
✔ |
✔ |
|
|
✔ |
||
|
✔ |
✔ |
✔ |
|
✔ |
✔ |
✔ |
|
✔ |
✔ |
✔ |
|
✔ |
||
|
✔ |
✔ |
✔ |
|
✔ |
✔ |
✔ |
|
✔ |
✔ |
✔ |
|
✔ |
模型
|
AWS 欧洲中部 1
(法兰克福)
|
AWS 欧洲西部 1
(爱尔兰)
|
Azure 西欧
(荷兰)
|
|---|---|---|---|
|
✔ |
✔ |
|
|
✔ |
✔ |
✔ |
|
✔ |
✔ |
|
|
✔ |
✔ |
|
|
✔ |
✔ |
✔ |
模型
|
AWS AP 东南部 2
(悉尼)
|
AWS AP 东北部 1
(东京)
|
|---|---|---|
|
✔ |
✔ |
|
✔ |
✔ |
|
✔ |
|
|
✔ |
|
|
✔ |
✔ |
* Indicates a preview function or model. Preview features are not suitable for production workloads.
您还可以在任何支持的区域,使用任何 微调 模型。
Features¶
Streaming¶
Both APIs support streaming responses using server-sent events (https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events).
Chat Completions streaming¶
Messages API streaming¶
Tool calling¶
Tool calling lets the model invoke external functions during a conversation. The flow works in steps:
You send a request with a list of available tools.
The model decides to call one or more tools and returns the tool name and arguments.
You execute the tool on your end.
You send the tool result back, and the model generates a final response.
Tool calling is supported for OpenAI and Claude models.
Chat Completions tool calling¶
Step 1 — Send the request with tools:
The model responds with a tool_calls array:
Step 2 — Execute the tool and send the result back:
Messages API tool calling¶
Step 1 — Send the request with tools:
The model responds with a tool_use content block:
Step 2 — Execute the tool and send the result back:
Structured output¶
You can request structured JSON output that conforms to a specific schema. This is supported for OpenAI and Claude
models through the Chat Completions API. For the Messages API, use the tool_use pattern to enforce structured output.
Chat Completions structured output¶
Use the response_format field with a JSON schema to constrain the model's output.
备注
Claude models support only json_schema as the response format type. OpenAI models support additional
response format types as documented in the OpenAI API reference (https://platform.openai.com/docs/api-reference/chat/create).
Messages API structured output¶
The Messages API does not have a response_format field. Instead, define a tool with your desired output schema
and instruct the model to use it. The model's tool_use response will contain structured JSON matching your schema.
Image input¶
You can include images in your requests for models that support vision. Images must be provided as base64-encoded strings. Images are limited to 20 per conversation with a 20 MiB max request size.
Image input is supported for:
Claude models (
claude-3-7-sonnetand newer)OpenAI models (
openai-gpt-4.1,openai-gpt-5,openai-gpt-5-chat,openai-gpt-5-mini,openai-gpt-5-nano)
Chat Completions image input¶
Messages API image input¶
The Messages API uses a different image format — a source block with type, media_type, and data fields
instead of a data URL.
Prompt caching¶
Prompt caching lets you reuse previously processed context (such as large system prompts, documents, or conversation history) across requests, reducing latency and cost.
OpenAI models: Caching is implicit. Prompts with 1,024+ tokens are automatically cached — no request changes needed.
Claude models: Caching is explicit. Add
cache_controlbreakpoints to content blocks you want cached. Only theephemeralcache type is supported, with a 5-minute TTL. A maximum of 4 cache breakpoints per request.
Chat Completions prompt caching¶
For Claude models via Chat Completions, add cache_control to content blocks. OpenAI models are cached
automatically and do not require this field.
Messages API prompt caching¶
Use cache_control on system or user content blocks. Only the ephemeral cache type is supported,
with a 5-minute TTL. A maximum of 4 cache breakpoints can be set per request.
备注
Anthropic prompt caching has a 5-minute TTL. Cached content not accessed within 5 minutes is evicted.
OpenAI prompt caching is implicit and managed automatically — no cache_control fields needed.
Thinking and reasoning¶
Chat Completions thinking¶
For Claude models, use the reasoning object. For OpenAI reasoning models, use the reasoning_effort field
(values: minimal, low, medium, high).
Messages API thinking¶
Some Claude models support adaptive thinking, where the model adjusts how much reasoning it applies based on task complexity. The following models support adaptive thinking:
claude-opus-4-6
For the Messages API, use the thinking parameter with type: "adaptive" to enable adaptive thinking. The output_config.effort parameter provides some high-level control over the thinking depth, and accepts the following values:
Effort level |
Behavior |
|---|---|
|
Always thinks with no constraints on thinking depth. Claude Opus 4.6 only. |
|
Always thinks. Provides deep reasoning on complex tasks. |
|
Moderate thinking. May skip thinking for very simple queries. |
|
Minimizes thinking. Skips thinking for simple tasks where speed matters most. |
The following examples demonstrate how to make a Messages API call with adaptive thinking enabled:
The response includes thinking blocks with summarized thinking and thinking signatures. Pass these blocks back in multi-turn conversations to maintain reasoning context:
For a full description of the Messages API support for Adaptive Thinking, see Claude API Docs -- Adaptive thinking (https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking).
Beta features (Messages API)¶
The Messages API supports Anthropic beta features via the anthropic-beta header. Pass one or more beta header
values as a comma-separated string.
Beta header value |
Feature |
|---|---|
|
Token-efficient tools |
|
Interleaved thinking |
|
Enables output tokens up to 128K |
|
Developer mode for raw thinking on Claude 4+ models |
|
1 million token context window |
|
Context management |
|
Effort parameter for thinking |
|
Tool search tool |
|
Tool use examples |
The following example enables the 1 million token context window with claude-sonnet-4-6:
You can combine multiple beta features by passing a comma-separated string:
Chat Completions API reference¶
POST /api/v2/cortex/v1/chat/completions¶
Generates a chat completion using the specified model. The request and response format follows the OpenAI Chat Completions API specification (https://platform.openai.com/docs/api-reference/chat/create).
必要标头¶
Authorization: Bearer tokenAuthorization for the request.
tokenis a JSON web token (JWT), OAuth token, or programmatic access token. For details, see 使用 Snowflake 对 Snowflake REST APIs 进行身份验证.Content-Type: application/json指定请求的主体是 JSON 格式。
可选标头¶
X-Snowflake-Authorization-Token-Type: type定义授权令牌的类型。
如果您省略了
X-Snowflake-Authorization-Token-Type标头,Snowflake 将通过检查令牌来确定令牌类型。即使此标头可选,您也可以选择指定此标头。您可以将标头设置为以下值之一:
Accept: application/json, text/event-stream指定响应将包含 JSON(错误情况)或服务器发送的事件。
Required JSON fields¶
字段 |
类型 |
描述 |
|---|---|---|
|
字符串 |
The model to use (see 模型可用性).
You may also use the fully-qualified name of any
fine-tuned model in the format
|
|
数组 |
An array of message objects representing the conversation. Each message must have a |
Commonly used optional JSON fields¶
字段 |
类型 |
默认值 |
描述 |
|---|---|---|---|
|
整数 |
4096 |
Maximum tokens in the response. Theoretical maximum is 131,072; each model has its own output limit. |
|
数字 |
Varies by model |
Controls randomness. Values from 0 to 2. |
|
数字 |
1.0 |
Controls diversity via nucleus sampling. |
|
boolean |
false |
Whether to stream back partial progress as server-sent events. |
|
数组 |
null |
A list of tools the model may call. Each tool must have |
|
string or object |
|
Controls how the model selects tools. Options: |
|
对象 |
null |
Constrains the output format. Use |
|
字符串 |
null |
For OpenAI reasoning models. Values: |
|
对象 |
null |
For Claude models. Set |
See the detailed compatibility chart for the full list of supported fields per model family.
状态代码¶
- 200
OK Request completed successfully.
- 400
invalid options object 可选实参有无效值。
- 400
unknown model model_name 指定的模型不存在。
- 400
schema validation failed The response schema structure is incorrect.
- 400
max tokens of count exceeded The request exceeded the maximum number of tokens supported by the model.
- 400
all requests were throttled by remote service The request has been throttled. Try again later.
- 402
budget exceeded 超出了模型消耗预算。
- 403
Not Authorized 账户未启用 REST API,或调用用户的默认角色没有
snowflake.cortex_user数据库角色。- 429
too many requests The usage quota has been exceeded. Try again later.
- 503
inference timed out 请求耗时太长。
Limitations¶
If unset,
max_completion_tokensdefaults to 4096. Each model has its own output token limit.Tool calling is supported for OpenAI and Claude models only.
Audio is not supported.
Image understanding is supported for OpenAI and Claude models only. Images are limited to 20 per conversation with a 20 MiB max request size.
Only Claude models support ephemeral cache control points for prompt caching. OpenAI models support implicit caching.
Only Claude models support returning reasoning details in the response.
max_tokensis deprecated. Usemax_completion_tokensinstead.Error messages are generated by Snowflake, not by the model provider.
Detailed compatibility chart¶
The following tables summarize which request and response fields are supported when using the Chat Completions API with different Snowflake-hosted model families.
字段 |
OpenAI Models |
Claude Models |
Other Models |
|---|---|---|---|
|
✔ Supported |
✔ Supported |
✔ Supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
❌ Error |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
✔ Only user/assistant/system |
✔ Only user/assistant/system |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
❌ Error |
|
❌ Ignored |
✔ Supported (ephemeral only) |
❌ Ignored |
|
❌ Error |
❌ Error |
❌ Ignored |
|
❌ Error |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported (deprecated) |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
✔ Supported |
❌ Ignored |
|
✔ Supported |
✔ Only |
❌ Ignored |
|
❌ Ignored |
✔ OpenRouter format |
❌ Ignored |
|
❌ Error |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
❌ Error (deprecated) |
❌ Error (deprecated) |
❌ Error (deprecated) |
|
✔ Supported (4096 default, 131072 max) |
✔ Supported (4096 default, 131072 max) |
✔ Supported (4096 default, 131072 max) |
|
❌ Ignored |
❌ Ignored |
❌ Ignored |
|
❌ Ignored |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
❌ Ignored (use |
❌ Ignored |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported (overrides |
✔ Converted to |
❌ Ignored |
|
❌ Ignored |
✔ Supported |
❌ Ignored |
|
✔ Supported |
✔ Only |
❌ Ignored |
|
❌ Ignored |
❌ Ignored |
❌ Ignored |
|
❌ Error |
❌ Error |
❌ Error |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
❌ Error |
❌ Error |
❌ Error |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
❌ Ignored |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Only |
❌ Ignored |
|
✔ Supported |
✔ Only |
❌ Error |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
❌ Ignored |
❌ Ignored |
|
❌ Error |
❌ Ignored |
❌ Ignored |
字段 |
OpenAI Models |
Claude Models |
Other Models |
|---|---|---|---|
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
✔ Single choice only |
✔ Single choice only |
|
✔ Supported |
❌ Not supported |
✔ Only |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
❌ Not supported |
❌ Not supported |
❌ Not supported |
|
❌ Not supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
✔ Only |
❌ Not supported |
|
❌ Not supported |
✔ OpenRouter format |
❌ Not supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
✔ Only |
❌ Not supported |
|
❌ Not supported |
✔ OpenRouter format |
❌ Not supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
✔ Supported |
✔ Supported |
✔ Supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
❌ Not supported |
❌ Not supported |
❌ Not supported |
|
✔ Only cache reads |
✔ Cache read + write |
❌ Not supported |
|
See sub-fields |
See sub-fields |
See sub-fields |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
❌ Not supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
|
✔ Supported |
❌ Not supported |
❌ Not supported |
Header |
Support |
|---|---|
|
✔ Required |
|
✔ Supported ( |
|
✔ Supported ( |
Header |
Support |
|---|---|
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
|
❌ Not supported |
Learn more¶
For additional usage examples, see the OpenAI Chat Completions API reference (https://platform.openai.com/docs/guides/completions/) or the OpenAI Cookbook (https://cookbook.openai.com/).
In addition to providing compatibility with the Chat Completions API, Snowflake supports OpenRouter-compatible features for Claude models. These features are exposed as extra fields on the request:
For prompt caching, use the
cache_controlfield. See the OpenRouter prompt caching documentation (https://openrouter.ai/docs/features/prompt-caching).For reasoning tokens, use the
reasoningfield. See the OpenRouter reasoning documentation (https://openrouter.ai/docs/use-cases/reasoning-tokens).
Messages API reference¶
POST /api/v2/cortex/v1/messages¶
Generates a response using a Claude model. The request and response format follows the Anthropic Messages API specification (https://docs.anthropic.com/en/api/messages).
备注
The Messages API supports Claude models only. For other models, use the Chat Completions API.
必要标头¶
Authorization: Bearer tokenAuthorization for the request.
tokenis a JSON web token (JWT), OAuth token, or programmatic access token. For details, see 使用 Snowflake 对 Snowflake REST APIs 进行身份验证.Content-Type: application/json指定请求的主体是 JSON 格式。
anthropic-version: 2023-06-01Required Anthropic API version header.
可选标头¶
X-Snowflake-Authorization-Token-Type: type定义授权令牌的类型。
如果您省略了
X-Snowflake-Authorization-Token-Type标头,Snowflake 将通过检查令牌来确定令牌类型。即使此标头可选,您也可以选择指定此标头。您可以将标头设置为以下值之一:
anthropic-beta: featureEnables beta features. Only Bedrock-compatible beta headers are supported.
Required JSON fields¶
字段 |
类型 |
描述 |
|---|---|---|
|
字符串 |
The Claude model to use (see 模型可用性). |
|
整数 |
The maximum number of tokens to generate. |
|
数组 |
An array of message objects. Each message has a |
Supported features¶
The Messages API supports the standard Anthropic Messages API feature set for Claude models, including:
Text generation and multi-turn conversations
Streaming (
"stream": true)System messages (via top-level
systemfield)Tool calling (Anthropic format with
name,description,input_schema)Image input (base64 source blocks)
Prompt caching (
cache_controlon content blocks)Extended thinking (
thinkingparameter withbudget_tokens)
For full request and response schema details, see the Anthropic Messages API documentation (https://docs.anthropic.com/en/api/messages).
Limitations¶
Claude models only. OpenAI, Llama, Mistral, and other models are not available through this endpoint.
No flex processing or priority tier. The
service_tierfield is not supported.Bedrock beta headers only. Only Bedrock-compatible
anthropic-betaheader values are supported.Error messages are generated by Snowflake, not by Anthropic.
状态代码¶
- 200
OK Request completed successfully.
- 400
invalid_request_error The request body is malformed or contains invalid values.
- 400
unknown model model_name The specified model does not exist or is not a Claude model.
- 402
budget exceeded 超出了模型消耗预算。
- 403
Not Authorized Account not enabled for REST API, or the default role does not have the
snowflake.cortex_userdatabase role.- 429
too many requests The usage quota has been exceeded. Try again later.
- 503
inference timed out 请求耗时太长。
Rate limits¶
To ensure high performance for all Snowflake customers, Cortex REST API requests are subject to rate limits. Requests exceeding the limits may receive an HTTP 429 response. Snowflake may occasionally adjust these limits.
The default limits in the following tables are applied per account and independently for each model. Ensure your application handles 429 responses gracefully by retrying with exponential backoff (https://platform.openai.com/docs/guides/rate-limits#retrying-with-exponential-backoff).
If you need to increase the limits, contact Snowflake Support.
模型
|
每分钟
处理的词元 (TPM)
|
每分钟的
请求 (RPM)
|
最大输出(词元)
|
|---|---|---|---|
claude-3-5-sonnet |
300,000 |
30 |
16,384 |
claude-3-7-sonnet |
300,000 |
30 |
16,384 |
claude-sonnet-4-5 |
600,000 |
60 |
16,384 |
claude-haiku-4-5 |
600,000 |
60 |
16,384 |
claude-4-sonnet |
300,000 |
30 |
16,384 |
claude-4-opus |
75,000 |
75 |
16,384 |
deepseek-r1 |
100,000 |
100 |
16,384 |
llama3.1-8b |
400,000 |
400 |
16,384 |
llama3.1-70b |
200,000 |
200 |
16,384 |
llama3.1-405b |
100,000 |
100 |
16,384 |
mistral-7b |
400,000 |
400 |
16,384 |
mistral-large2 |
200,000 |
200 |
16,384 |
openai-gpt-4.1 |
300,000 |
30 |
16,384 |
openai-gpt-5 |
300,000 |
30 |
16,384 |
openai-gpt-5-chat |
300,000 |
30 |
16,384 |
openai-gpt-5-mini |
1,000,000 |
1,000 |
16,384 |
openai-gpt-5-nano |
5,000,000 |
5,000 |
16,384 |
通过跨区域推理提高速率限制¶
If you set up cross-region inference in your Snowflake Account, the rate limits are higher for the following models:
模型
|
每分钟
处理的词元 (TPM)
|
每分钟的
请求 (RPM)
|
最大输出(词元)
|
|---|---|---|---|
claude-3-7-sonnet |
600,000 |
60 |
16,384 |
claude-haiku-4-5 |
600,000 |
60 |
16,384 |
claude-sonnet-4-5 |
600,000 |
60 |
16,384 |
claude-4-sonnet |
1,200,000 |
200 |
16,384 |
claude-4-opus |
150,000 |
50 |
16,384 |
llama3.1-8b |
800,000 |
400 |
16,384 |
llama3.1-70b |
400,000 |
200 |
16,384 |
llama3.1-405b |
200,000 |
100 |
16,384 |
Troubleshooting rate limit events¶
Offending either the TPM or RPM limits will result in a 429 response code. If your REST API usage is below the request per minute rate limit but still received a 429 response code, double check the token usage rate.
Cortex REST API implements rate limits using the Sliding Window Counter (https://blog.cloudflare.com/counting-things-a-lot-of-different-things/#sliding-windows-to-the-rescue) pattern. The counters are stored in a highly-available Redis cluster only accessible by Snowflake Cortex within Snowflake's private network.
The sliding-window counter assumes that client traffic to the API in the previous time window is uniformly distributed. When traffic is spiky, this assumption could overestimate the rate of requests, but recovers quickly given the window is short. Please contact Snowflake Support if you are subject to the overestimation and want to increase the limits.
Known issues¶
Session token expiration¶
We recommended authenticating with one of the three methods defined in 使用 Snowflake 对 Snowflake REST APIs 进行身份验证. However, if you choose to authenticate with a Snowflake session token, you must handle token refresh to ensure uninterrupted API access.
Session tokens expire periodically. If a request is executed with an expired session token, the REST API returns a 200 OK response that includes error code 390112. When this occurs, the operation is not performed.
To handle this behavior, your application should:
Check each API response for error code
390112, even when the HTTP status code is200 OK.When error code
390112is detected, refresh the session token and retry the request.
备注
This behavior only affects applications using Snowflake session tokens. If you authenticate using key pair authentication, OAuth, or programmatic access tokens (PATs), you do not need to implement this error handling.
成本注意事项¶
Snowflake Cortex REST API requests incur compute costs based on the number of tokens processed. Refer to the Snowflake Service Consumption Table for each model's cost in dollars per million tokens.
A token is the smallest unit of text processed by Snowflake Cortex LLM functions, approximately equal to four characters of text. The equivalence of raw input or output text to tokens can vary by model.
Both input and output tokens incur compute cost. If you use the API to provide a conversational or chat user experience, all previous prompts and responses are processed to generate each new response, with corresponding costs.
