Extracting information from documents with AI_EXTRACT¶
AI_EXTRACT is a Cortex AI Function that lets you extract structured information, such as entities, lists, and tables, from text or document files, by asking questions in natural language or by describing information to be extracted. It can be used with other functions to create custom document processing pipelines for a variety of use cases (see Cortex AI Functions: Documents).
AI_EXTRACT can process documents of various formats (in 29 languages) and extract information from both text-heavy paragraphs and content in a graphical form, such as logos, handwritten text (for example, signatures), tables, or checkmarks). AI_EXTRACT can extract information in the following structured formats:
Entity: Ask questions in natural language or describe the information to be extracted (such as city, street, or ZIP code).
List (or array) : You can provide a JSON schema to extract an array or list of information present in the document, such as the name of all account holders in a bank statement or a list of all addresses in a Document.
Table : Provide a JSON schema to extract tabular data present in the document by specifying the table title and a list of columns that should be extracted.
AI_EXTRACT scales automatically with your workload by processing multiple documents simultaneously. Documents can be processed directly from object storage to avoid unnecessary data movement.
Note
AI_EXTRACT is currently incompatible with custom network policies.
Extraction quality¶
AI_EXTRACT uses arctic-extract, a proprietary vision based large language model (LLM) that delivers high extraction accuracy.
The following table presents the model’s scores on various standard benchmarks, with the scores of other popular models for comparison:
Visual question answering (VQA)¶
Offering |
DocVQA score |
|---|---|
Human evaluation |
0.9811 |
Snowflake Arctic-Extract |
0.9433 |
Azure OpenAI GPT-o3 |
0.9339 |
Google Gemini 2.5-Pro |
0.9316 |
Google Anthropic Claude 4 Sonnet |
0.9119 |
Azure Document Intelligence + GPT-o3 |
0.8853 |
Google Document AI + Gemini |
0.8497 |
Azure OpenAI GPT-o3 |
0.9339 |
AWS Textract |
0.8313 |
Text-only question answering (SQuAD v2)¶
Offering |
ANLS |
Exact match |
|---|---|---|
Snowflake Arctic-Extract |
81.18 |
78.74 |
Anthropic Claude 4 Sonnet |
80.54 |
77.10 |
Meta LLaMA 3.1 405B |
80.37 |
76.56 |
Meta LLaMA 4 Scout |
74.30 |
70.70 |
OpenAI GPT 4.1 |
70.71 |
66.81 |
Meta LLaMA 3.1 8B |
59.13 |
54.48 |
Examples¶
These examples use the following image as the input document. The document is stored on a stage.
Extracting an entity¶
Extracts the seller name and the offer expiration date from the Sale Agreement.
SELECT AI_EXTRACT(
file => TO_FILE('@db.schema.stage','document.pdf'),
responseFormat => [['seller_name', 'What is the seller name?'], ['address', 'What is the offer expiration date?']]
);
Result:
{
"error": null,
"response": {
"address": "12/12/2023",
"seller_name": "Paul Doyle"
}
}
Extracting checkbox information¶
This example extracts information about items that are not included, based on the checkboxes marked in the document.
SELECT AI_EXTRACT(
file => TO_FILE('@db.schema.stage','document.pdf'),
responseFormat => [['flat_items', 'What items are not included with the flat?'], ['default', 'What Default is selected?']]
);
Result:
{
"error": null,
"response": {
"default": "Forfeiture of Earnest Money",
"flat_items": "dryer, security system, satellite dish, wood stove, fireplace insert, hot tub, attached speaker(s), generator, other"
}
}
Extracting signature status¶
This example extracts information about whether the agreement has been signed.
SELECT AI_EXTRACT(
file => TO_FILE('@db.schema.stage','document.pdf'),
responseFormat => [['signature', 'Is this document signed?']]
);
Result:
{
"error": null,
"response": {
"signature": "no"
}
}
Extracting a list of entities¶
This example extracts a list of buyer names.
SELECT AI_EXTRACT(
file => TO_FILE('@db.schema.files', 'report.pdf'),
responseFormat => {
'schema': {
'type': 'object',
'properties': {
'buyer_list': {
'description': 'What are the buyer names?',
'type': 'array'
}
}
}
}
);
Result:
{
"error": null,
"response": {
"buyer_list": [
"John Davis",
"Jane Davis"
]
}
}
Extract table information¶
This example extracts tabular data from the following document.
SELECT AI_EXTRACT(
file => TO_FILE('@db.schema.files', 'report.pdf'),
responseFormat => {
'schema': {
'type': 'object',
'properties': {
'income_table': {
'description': 'Table 2: Granger Causality Tests - P-values',
'type': 'object',
'column_ordering': ['description', 'countries','lags','z','z_approx'],
'properties': {
'description': {
'description': 'Description',
'type': 'array'
},
'countries': {
'description': 'Countries',
'type': 'array'
},
'lags': {
'description': 'Lags',
'type': 'array'
},
'z': {
'description': 'Z',
'type': 'array'
},
'z_approx': {
'description': 'Z approx.',
'type': 'array'
}
}
}
}
}
);
{
"error": null,
"response": {
"income_table": {
"countries": [
"33","80","29","84","34"
],
"description": [
"Commodity exporters",
"Non-commodity exporters",
"AE",
"EMDE",
"Large or market-dominant countries"
],
"lags": [
"2","1","1","1","1"
],
"z": [
"0.11","0.08","0.89","0.12","0.07"
],
"z_approx": [
"0.25","0.19","0.95","0.25","0.14"
]
}
}
}
Input requirements¶
AI_EXTRACT is optimized for documents both digital-born and scanned. The following table lists the limitations and requirements of input documents:
Maximum file size |
100 MB |
|---|---|
Maximum pages per document |
125 |
Maximum questions |
|
Supported file type |
PDF, PPT, PPTX, DOCX, EML, DOC, DOCX, HTM, HTML, TEXT, MD, TXT, BMP, JPEG, JPG, PNG, TIFF, TIF, WEBP |
Stage encryption |
Server-side encryption |
Access control requirements¶
To use the AI_EXTRACT function, a user with the ACCOUNTADMIN role must grant the SNOWFLAKE.CORTEX_USER database role to the user who will call the function. See Cortex LLM privileges topic for details.
Cost considerations¶
The Cortex AI_EXTRACT function incurs compute costs based on the number of pages per document, input prompt tokens, and output tokens processed.
For paged file formats (PDF, DOCX, TIF, TIFF), each page is counted as 970 tokens.
For image file formats (JPEG, JPG, PNG), each individual image file is billed as a page and counted as 970 tokens
Snowflake recommends executing queries that call the Cortex AI_EXTRACT function in a smaller warehouse (no larger than MEDIUM). Larger warehouses do not increase performance.
Supported languages¶
AI_EXTRACT supports the following languages
Arabic
Bengali
Burmese
Cebuano
Chinese
Czech
Dutch
English
French
German
Hebrew
Hindi
Indonesian
Italian
Japanese
Khmer
Korean
Lao
Malay
Persian
Polish
Portuguese
Russian
Spanish
Tagalog
Thai
Turkish
Urdu
Vietnamese
Regional availability¶
Support for AI_EXTRACT is available to accounts in the following Snowflake regions:
AWS |
Azure |
|---|---|
US West 2 |
East US 2 |
US East 1 |
West US 2 |
US CA Central 1 |
South Central US |
Europe Central 1 |
North Europe |
Europe West 1 |
West Europe |
SA East 1 |
Central India |
AP Northeast 1 |
Japan East |
AP Southeast 2 |
Southeast Asia Australia East |
AI_EXTRACT has cross-region support. For information on enabling Cortex AI cross-region support, see Cross-region inference.
Error conditions¶
Snowflake Cortex AI_EXTRACT can produce the following error messages:
Message |
Explanation |
|---|---|
|
A system error occurred. Wait and try again. If the error persists, contact Snowflake support. |
|
The file was not found. |
|
The file was not found. |
|
The current user does not have sufficient privileges too access the file. |
|
The document is not in a supported format. |
|
The document is not stored in a stage with server-side encryption. |
|
No parameters were provided. |
|
No response format was provided. |
|
The response format is not valid JSON. |
|
The response format contains one or more duplicate feature names. |
|
The number of questions exceeds the allowed limit. |
|
The document exceeds the 125-page limit. |
|
Image input or a converted document page is larger than the supported dimensions. |
|
Page is larger than the supported dimensions. |
|
The document is larger than 100 MB. |
Legal notices¶
The data classification of inputs and outputs are as set forth in the following table.
Input data classification |
Output data classification |
Designation |
|---|---|---|
Usage Data |
Customer Data |
Generally available functions are Covered AI Features. Preview functions are Preview AI Features. [1] |
For additional information, refer to Snowflake AI and ML.