类别：: 字符串和二进制函数 (AI Functions)

AI_PARSE_DOCUMENT¶

备注

AI_PARSE_DOCUMENT 是 PARSE_DOCUMENT (SNOWFLAKE.CORTEX) 的更新版本。要获得最新功能，请使用 AI_PARSE_DOCUMENT。

Returns the extracted content from a document on a Snowflake stage as a JSON-formatted string. This function supports two types of extraction: Optical Character Recognition (OCR) and layout. For more information, see Parsing documents with AI_PARSE_DOCUMENT.

语法¶

AI_PARSE_DOCUMENT( <file_object>, [ <options> ] )

Copy

实参¶

必填：

file_object: A FILE object that specifies the document to parse, stored in a Snowflake stage. For information about creating file objects, see TO_FILE.

可选：

options

包含用于解析文档的选项的 OBJECT 值。支持的键如下所示。所有键都是可选的。

'extract_images': If set to TRUE, the function extracts images embedded in the document. Requires LAYOUT mode.
'mode'：指定解析模式。支持的模式为：
- 'OCR'：函数仅提取文本。这是默认模式。
- 'LAYOUT'：函数提取布局和文本，包括表格等结构内容。
'page_split': If set to TRUE, the function splits the document into pages and processes each page separately. This feature supports only PDF, PowerPoint (.pptx), and Word (.docx) documents. Documents in other formats return an error. The default is FALSE.

小技巧

To process long documents that exceed the token limit of AI_PARSE_DOCUMENT, set this option to TRUE.
'page_filter': An array that specifies ranges of pages of a multi-page document to process. Each range is an object with start and end fields that specify the first (inclusive) and last (exclusive) page in the range. Page indexes start at 0. For example, {'start': 0, 'end': 1} specifies the first page of the document.

备注

Specifying page_filter implies page_split. If you specify page ranges, it is not necessary to also set page_split.

返回¶

包含提取的数据和关联的元数据的 JSON 对象（字符串形式）。options 实参决定了返回对象的结构。

小技巧

要在 SQL 中使用输出内容，请使用 PARSE_JSON 函数将其转换为 OBJECT 值。

如果设置了 'page_split' 选项，则输出具有以下结构：

"pages"：一个 JSON 对象数组，每个对象都包含从文档中提取的文本。如果文档只有一页，则输出仍包含一个 "pages" 数组（该数组只包含一个对象）。每个页面都有以下字段：

"content"：纯文本（在 OCR 模式中）或 Markdown 格式的文本（在 LAYOUT 模式中）。

"index"：文件中的页面索引，从 0 开始。在文档中指定的页码和格式将被忽略。

"errorInformation"：如果文档无法被解析，则包含错误信息。

"metadata"：包含有关文档的元数据，例如页数。

备注

解析成功时，"pages" 和 "metadata" 字段会出现在输出中。"errorInformation" 仅在解析失败时才会出现。

如果 'page_split' 为 FALSE 或不存在，则输出具有以下结构：

"content"：纯文本（在 OCR 模式中）或 Markdown 格式的文本（在 LAYOUT 模式中）。

"errorInformation"：如果文档无法被解析，则包含错误信息。

"metadata"：包含有关文档的元数据，例如页数。

备注

解析成功时，"content" 和 "metadata" 字段会出现在输出中。"errorInformation" 仅在解析失败时才会出现。

If the "extract_images" option is set to TRUE, the output includes an additional field:

"images": An array of JSON objects, each representing an extracted image. Each image object has the following fields:

"id": A unique identifier for the image.

"top_left_x", "top_left_y", "bottom_right_x", "bottom_right_y": The coordinates of the bounding box of the image on the page.

"image_base64": The extracted image data encoded as a base64 string.

示例¶

For examples, see AI_PARSE_DOCUMENT examples.

限制¶

Snowflake Cortex 函数不支持动态表。