Extracting information from documents with AI_EXTRACT

AI_EXTRACT is a Cortex AI Function that lets you extract structured information, such as entities, lists, and tables, from text or document files, by asking questions in natural language or by describing information to be extracted. It can be used with other functions to create custom document processing pipelines for a variety of use cases (see Cortex AI Functions: Documents).

AI_EXTRACT can process documents of various formats (in 29 languages) and extract information from both text-heavy paragraphs and content in a graphical form, such as logos, handwritten text (for example, signatures), tables, or checkmarks). AI_EXTRACT can extract information in the following structured formats:

  • Entity: Ask questions in natural language or describe the information to be extracted (such as city, street, or ZIP code).

  • List (or array) : You can provide a JSON schema to extract an array or list of information present in the document, such as the name of all account holders in a bank statement or a list of all addresses in a Document.

  • Table : Provide a JSON schema to extract tabular data present in the document by specifying the table title and a list of columns that should be extracted.

AI_EXTRACT scales automatically with your workload by processing multiple documents simultaneously. Documents can be processed directly from object storage to avoid unnecessary data movement.

Note

AI_EXTRACT is currently incompatible with custom network policies.

Extraction quality

AI_EXTRACT uses arctic-extract, a proprietary vision based large language model (LLM) that delivers high extraction accuracy. The following table presents the model’s scores on various standard benchmarks, with the scores of other popular models for comparison:

Visual question answering (VQA)

Offering

DocVQA score

Human evaluation

0.9811

Snowflake Arctic-Extract

0.9433

Azure OpenAI GPT-o3

0.9339

Google Gemini 2.5-Pro

0.9316

Google Anthropic Claude 4 Sonnet

0.9119

Azure Document Intelligence + GPT-o3

0.8853

Google Document AI + Gemini

0.8497

Azure OpenAI GPT-o3

0.9339

AWS Textract

0.8313

Text-only question answering (SQuAD v2)

Offering

ANLS

Exact match

Snowflake Arctic-Extract

81.18

78.74

Anthropic Claude 4 Sonnet

80.54

77.10

Meta LLaMA 3.1 405B

80.37

76.56

Meta LLaMA 4 Scout

74.30

70.70

OpenAI GPT 4.1

70.71

66.81

Meta LLaMA 3.1 8B

59.13

54.48

Examples

These examples use the following image as the input document. The document is stored on a stage.

Condominium Purchase and Sale Agreement

Extracting an entity

Extracts the seller name and the offer expiration date from the Sale Agreement.

SELECT AI_EXTRACT(
  file => TO_FILE('@db.schema.stage','document.pdf'),
  responseFormat => [['seller_name', 'What is the seller name?'], ['address', 'What is the offer expiration date?']]
);
Copy

Result:

{
    "error": null,
    "response": {
        "address": "12/12/2023",
        "seller_name": "Paul Doyle"
    }
}
Copy

Extracting checkbox information

This example extracts information about items that are not included, based on the checkboxes marked in the document.

SELECT AI_EXTRACT(
  file => TO_FILE('@db.schema.stage','document.pdf'),
  responseFormat => [['flat_items', 'What items are not included with the flat?'], ['default', 'What Default is selected?']]
);
Copy

Result:

{
    "error": null,
    "response": {
        "default": "Forfeiture of Earnest Money",
        "flat_items": "dryer, security system, satellite dish, wood stove, fireplace insert, hot tub, attached speaker(s), generator, other"
    }
}
Copy

Extracting signature status

This example extracts information about whether the agreement has been signed.

SELECT AI_EXTRACT(
    file => TO_FILE('@db.schema.stage','document.pdf'),
    responseFormat => [['signature', 'Is this document signed?']]
);
Copy

Result:

{
  "error": null,
    "response": {
        "signature": "no"
    }
}
Copy

Extracting a list of entities

This example extracts a list of buyer names.

SELECT AI_EXTRACT(
    file => TO_FILE('@db.schema.files', 'report.pdf'),
    responseFormat => {
        'schema': {
        'type': 'object',
        'properties': {
            'buyer_list': {
            'description': 'What are the buyer names?',
            'type': 'array'
            }
        }
        }
    }
);
Copy

Result:

{
    "error": null,
    "response": {
        "buyer_list": [
        "John Davis",
        "Jane Davis"
        ]
    }
}
Copy

Extract table information

This example extracts tabular data from the following document.

Granger Causality Tests - P-values
SELECT AI_EXTRACT(
    file => TO_FILE('@db.schema.files', 'report.pdf'),
    responseFormat => {
        'schema': {
            'type': 'object',
            'properties': {
                'income_table': {
                'description': 'Table 2: Granger Causality Tests - P-values',
                'type': 'object',
                'column_ordering': ['description', 'countries','lags','z','z_approx'],
                'properties': {
                    'description': {
                        'description': 'Description',
                        'type': 'array'
                        },
                    'countries': {
                        'description': 'Countries',
                        'type': 'array'
                        },
                    'lags': {
                        'description': 'Lags',
                        'type': 'array'
                        },
                    'z': {
                        'description': 'Z',
                        'type': 'array'
                    },
                    'z_approx': {
                        'description': 'Z approx.',
                        'type': 'array'
                    }
                }
            }
        }
    }
);
Copy
{
    "error": null,
    "response": {
        "income_table": {
            "countries": [
                "33","80","29","84","34"
            ],
            "description": [
                "Commodity exporters",
                "Non-commodity exporters",
                "AE",
                "EMDE",
                "Large or market-dominant countries"
            ],
            "lags": [
                "2","1","1","1","1"
            ],
            "z": [
                "0.11","0.08","0.89","0.12","0.07"
            ],
            "z_approx": [
                "0.25","0.19","0.95","0.25","0.14"
            ]
        }
    }
}
Copy

Input requirements

AI_EXTRACT is optimized for documents both digital-born and scanned. The following table lists the limitations and requirements of input documents:

Maximum file size

100 MB

Maximum pages per document

125

Maximum questions

  • 100 questions for entity (single or list) extraction

  • 10 questions for table extraction

Supported file type

PDF, PPT, PPTX, DOCX, EML, DOC, DOCX, HTM, HTML, TEXT, MD, TXT, BMP, JPEG, JPG, PNG, TIFF, TIF, WEBP

Stage encryption

Server-side encryption

Access control requirements

To use the AI_EXTRACT function, a user with the ACCOUNTADMIN role must grant the SNOWFLAKE.CORTEX_USER database role to the user who will call the function. See Cortex LLM privileges topic for details.

Cost considerations

The Cortex AI_EXTRACT function incurs compute costs based on the number of pages per document, input prompt tokens, and output tokens processed.

  • For paged file formats (PDF, DOCX, TIF, TIFF), each page is counted as 970 tokens.

  • For image file formats (JPEG, JPG, PNG), each individual image file is billed as a page and counted as 970 tokens

Snowflake recommends executing queries that call the Cortex AI_EXTRACT function in a smaller warehouse (no larger than MEDIUM). Larger warehouses do not increase performance.

Supported languages

AI_EXTRACT supports the following languages

  • Arabic

  • Bengali

  • Burmese

  • Cebuano

  • Chinese

  • Czech

  • Dutch

  • English

  • French

  • German

  • Hebrew

  • Hindi

  • Indonesian

  • Italian

  • Japanese

  • Khmer

  • Korean

  • Lao

  • Malay

  • Persian

  • Polish

  • Portuguese

  • Russian

  • Spanish

  • Tagalog

  • Thai

  • Turkish

  • Urdu

  • Vietnamese

Regional availability

Support for AI_EXTRACT is available to accounts in the following Snowflake regions:

AWS

Azure

US West 2

East US 2

US East 1

West US 2

US CA Central 1

South Central US

Europe Central 1

North Europe

Europe West 1

West Europe

SA East 1

Central India

AP Northeast 1

Japan East

AP Southeast 2

Southeast Asia Australia East

AI_EXTRACT has cross-region support. For information on enabling Cortex AI cross-region support, see Cross-region inference.

Error conditions

Snowflake Cortex AI_EXTRACT can produce the following error messages:

Message

Explanation

Internal error.

A system error occurred. Wait and try again. If the error persists, contact Snowflake support.

Not found.

The file was not found.

Provided file cannot be found.

The file was not found.

Provided file cannot be accessed.

The current user does not have sufficient privileges too access the file.

The provided file format {fil_extension} isn't supported.

The document is not in a supported format.

The provided file isn't in the expected format or is client-side encrypted or is corrupted.

The document is not stored in a stage with server-side encryption.

Empty request.

No parameters were provided.

Missing or empty response format.

No response format was provided.

Invalid response format.

The response format is not valid JSON.

Duplicate feature name found: {feature_name}.

The response format contains one or more duplicate feature names.

Too many questions: {number} complex and {number} simple = {number} total, complex question weight {number}.

The number of questions exceeds the allowed limit.

Maximum number of 125 pages exceeded. The document has {actual_pages} pages.

The document exceeds the 125-page limit.

Page size in pixels exceeds 10000x10000. The page size is {actual_px} pixels.

Image input or a converted document page is larger than the supported dimensions.

Page size in inches exceeds 50x50 (3600x3600 pt). The page size is {actual_in} inches ({actual_pt} pt).

Page is larger than the supported dimensions.

Maximum file size of 104857600 bytes exceeded. The file size is {actual_size} bytes.

The document is larger than 100 MB.

Language: English