Categories:

String & binary functions (Large Language Model)

PARSE_DOCUMENT (SNOWFLAKE.CORTEX)

Returns the extracted content from a document on a Snowflake stage as an OBJECT that contains JSON-encoded objects as strings. This function supports 2 types of extractions, Optical Character Recognition (OCR) and layout. To learn more, see Cortex Parse Document overview.

Syntax

SNOWFLAKE.CORTEX.PARSE_DOCUMENT( '@<stage>', '<path>', [ { 'mode': '<mode>' }, ] )
Copy

Arguments

Required:

stage

Name of the Snowflake stage.

path

Relative path to the document on the Snowflake stage.

Optional:

mode

Returns a value of the type OBJECT. In the object, the value for the key content contains the extracted data as a JSON-encoded string. The data can either be formatted or in plain text, depending on the mode specified in the call:

  • If mode is LAYOUT, the data is markdown with structural content including tables.

  • If mode is OCR, the data is the text content.

Default: 'OCR'

Returns

An OBJECT data type that contains the extracted data as a JSON-encoded string. The content depends on the mode used in the call:

  • LAYOUT mode: JSON with key “content” containing markdown with tables extracted from the document.

  • OCR mode: JSON with key “content” containing the text content from the document.

Examples

OCR mode

SELECT TO_VARCHAR(
    SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
        '@PARSE_DOCUMENT.DEMO.documents',
        'document_1.pdf',
        {'mode': 'OCR'}):content
    ) AS OCR;
Copy

Output:

{
    "content of the document"
}

LAYOUT mode

This example parses a document with a table shown in the following screenshot:

Example PDF content with a table
SELECT
  TO_VARCHAR (
    SNOWFLAKE.CORTEX.PARSE_DOCUMENT (
        '@PARSE_DOCUMENT.DEMO.documents',
        'document_1.pdf',
        {'mode': 'LAYOUT'} ):content ) AS LAYOUT;
Copy

Output:

{
  "content": "# This is PARSE DOCUMENT example
     Example table:
     |Header|Second header|Third Header|
     |:---:|:---:|:---:|
     |First row header|Data in first row|Data in first row|
     |Second row header|Data in second row.|Data in second row.|

     Some more text."
 }
Language: English