带有文档的 AI_COMPLETE

Cortex AI_COMPLETE 函数是一种通用 AI 函数,能够理解存储在 PDF、Microsoft Word 及其他文档文件格式中的数据。您可以使用 AI_COMPLETE 执行各种文档数据提取任务,例如:

  • 使用图表中的数据回答问题。

  • 查找图表和文档文本之间的关系。

  • 总结特定问题中的文档内容。

  • 从文档中提取实体。

AI_COMPLETE 相较于其他 文档处理 AI 函数 的优势在于可以选择模型,从而可以为特定的文档处理任务使用最合适的模型。

使用 AI_COMPLETE 处理文档

COMPLETE 函数处理存储在内部 Snowflake 暂存区或外部暂存区中的文档文件。完成提示可以引用单个文档或多个文档。例如,您可以通过提供原始文档和翻译后的文档作为函数的输入来比较营销材料翻译的正确性,并提示要求模型评估翻译质量。

调用函数时,必须指定要使用的模型和提示。提示应包括说明以及要处理的每个文档的 FILE 对象引用。有关示例提示和完成,请参阅 示例;有关函数调用语法,请参阅 AI_COMPLETE(提示对象)

输入要求

AI_COMPLETE 针对数字化文档和扫描文档进行了优化。下表列出了输入文档的限制和要求:

支持的文件类型

全部模型:.txt、.md、.pdf
Claude 模型:.txt、.md、.pdf、.doc、.docx、.xls、.xlsx、.csv、.xhtml

暂存区加密

服务器端加密

数据类型

FILE 对象

备注

从使用 AI_COMPLETE 的暂存区处理文件目前与自定义网络策略不兼容。

示例

以下示例说明了如何使用 AI_COMPLETE 处理三个常见用例的文档:图表问答、上下文文档摘要和技术报告探索。

图表问答示例

“打破僵局”EU 报告的标题页

以下示例使用 Anthropic 的 Claude Opus 4 模型,在存储于 @docs 暂存区的文档 hdr2023-24snapshoten.pdf 中的上下文中分析图表中表示的数据。

SELECT AI_COMPLETE(
  MODEL => 'claude-4-opus',
  PROMPT => PROMPT('Compare the distributions of HDI in each group: low HDI group, medium HDI group, high HDI group and very high HDI group visualized in {0}', TO_FILE('@docs', 'hdr2023-24snapshoten.pdf'))
);

响应:

Looking at the document, I can see Figure S.2 on page 6 which shows the recovery of HDI values since the 2020-2021
decline across different HDI groups. The visualization shows:

**Low HDI group**:
- 49% recovered
- 51% did not recover

**Medium HDI group**:
- The document doesn't provide specific recovery percentages for this group in the figure

**High HDI group**:
- The document doesn't provide specific recovery percentages for this group in the figure

**Very high HDI group**:
- 100% recovered (all OECD countries)

The document also provides additional insights about HDI distributions:

1. **Inequality trends** (Figure S.3, page 7): The inequality between very high HDI and low HDI countries has been
   increasing since 2020, reversing a long-term declining trend. The difference in HDI values between these groups
   increased from 0.38 in 2017 to 0.39 in 2020 and is projected to continue rising.

2. **Overall recovery patterns**: The text notes that "every Organisation for Economic Co-operation and Development
   country is projected to have recovered, but only about half of the Least Developed Countries are projected to have
   done so."

3. **Global HDI trajectory**: While the global HDI value is projected to reach a record high in 2023, it remains below
   its pre-2019 trend, suggesting a permanent shift in the development trajectory.

The document emphasizes the highly unequal nature of recovery across HDI groups, with countries in higher HDI
categories showing much stronger recovery rates than those in lower HDI categories.

上下文文档摘要示例

EU 第 1008/2008 号决议第一页

以下示例使用 Anthropic 的 Claude Sonnet 4 模型提取具有复杂布局的法律文本的摘要。文档 CELEX_32008R1008_EN_TXT.pdf 存储在 @docs 暂存区;提示会缩小摘要上下文的范围。

SELECT AI_COMPLETE(
  MODEL => 'claude-4-sonnet',
  PROMPT => PROMPT('summarize the changes from the perspective of private aviation {0}', TO_FILE('@docs', 'CELEX_32008R1008_EN_TXT.pdf'))
);

响应:

From a private aviation perspective, this EU Regulation 1008/2008 introduces several significant changes that affect
private operators and business aviation:

## Key Changes Affecting Private Aviation:

### **Operating License Requirements**
- **Stricter financial oversight**: Private operators must demonstrate they can meet financial obligations for 24
  months from start of operations, with enhanced monitoring during the first two years
- **Simplified requirements for smaller operators**: Aircraft under 10 tonnes MTOM and/or less than 20 seats need
  only demonstrate €100,000 net capital (unless operating scheduled services or exceeding €3 million turnover)

### **Aircraft Registration and Leasing**
- **Flexible registration**: Private operators can register aircraft either in their home Member State or anywhere within the Community

- **Wet lease restrictions**: Stricter limitations on wet leasing aircraft from third countries, requiring prior
  approval and justification based on:
  - Exceptional needs (up to 7 months, renewable once)
  - Seasonal capacity requirements
  - Operational difficulties when EU-registered aircraft unavailable
- **Safety equivalence**: All third-country leased aircraft must meet equivalent safety standards to EU requirements.

### **Enhanced Supervision**

- **Unified oversight**: The same Member State authority now oversees both the Air Operator Certificate (AOC) and
  operating license, improving efficiency for operators with bases in multiple countries
- **Regular assessments**: Mandatory financial reviews, particularly after two years of operation and when potential
  problems are suspected

### **Insurance Requirements**
- **Extended coverage**: Insurance requirements now explicitly include mail liability coverage in addition to
  passengers, cargo, and third parties

### **Operational Flexibility**
- **Code-sharing freedom**: Private operators can more freely enter into code-share arrangements on intra-Community
  routes and routes to third countries
- **Pricing freedom**: Complete freedom to set fares and rates for intra-Community services

### **Administrative Streamlining**
- **Consolidated regulation**: The three separate regulations are now combined into one comprehensive framework,
  simplifying compliance
- **Reduced bureaucracy**: Member States cannot require documents already provided to licensing authorities

These changes generally **liberalize** private aviation operations within the EU while **strengthening** financial
and safety oversight, creating a more integrated and competitive market for private operators.

技术报告探索

75 毫米火炮规格手册首页

以下示例使用 Gemini 3 Pro 模型来分析技术报告图表中表示的伤亡数据。文档 75mm-M3-spec-booklet-MK-VI.pdf 存储在 @docs 暂存区。

SELECT AI_COMPLETE(
  MODEL => 'gemini-3-pro',
  PROMPT => PROMPT('explain findings from figures 69-73 of {0}', TO_FILE('@docs', '75mm-M3-spec-booklet-MK-VI.pdf'))
);

响应:

Based on the provided document, specifically **page 4**, here is an explanation of the findings from Figures 69
through 73. These figures illustrate the fragmentation patterns and effectiveness of the **75-mm Shell, H.E., M48**
when fired from an M3 Gun. They visualize how dangerous the shell is to personnel (casualties) and equipment
(perforation of mild steel) at different burst heights and orientations.

支持的模型和限制

Snowflake Cortex 可用的所有模型都对输入和输出词元的总数有限制,称为模型的 上下文窗口。上下文窗口大小以词元为单位测量。超过上下文窗口限制的输入会导致错误。

对于文本模型,词元通常表示大约四个字符的文本;与限制相对应的字数计数略小于以词元给出的上下文窗口。对于图像模型,每个文档的词元数量取决于视觉模型的架构。提示中的词元(例如,“总结此文档:”)也有助于模型的上下文窗口。

模型

上下文窗口(词元)

文件类型

文件大小

每个提示的文档

gemini-3-pro

1,000,000

.pdf、.txt、.md

10MB,最多 900 页

20

claude-3-7-sonnet

200,000

.txt, .md, .pdf, .doc, .docx, .xls, .xlsx, .csv, .xhtml

4.5MB

5

claude-4-sonnet

200,000

.txt, .md, .pdf, .doc, .docx, .xls, .xlsx, .csv, .xhtml

4.5MB

5

claude-4-opus

200,000

.txt, .md, .pdf, .doc, .docx, .xls, .xlsx, .csv, .xhtml

4.5MB

5

claude-haiku-4-5

200,000

.txt, .md, .pdf, .doc, .docx, .xls, .xlsx, .csv, .xhtml

4.5MB

5

claude-sonnet-4-5

200,000

.txt, .md, .pdf, .doc, .docx, .xls, .xlsx, .csv, .xhtml

4.5MB

5

claude-opus-4-5

200,000

.txt, .md, .pdf, .doc, .docx, .xls, .xlsx, .csv, .xhtml

4.5MB

5

访问控制要求

要使用 AI_COMPLETE 函数,ACCOUNTADMIN 角色用户须将 SNOWFLAKE.CORTEX_USER 数据库角色授予将调用该函数的用户。有关详细信息,请参阅 Cortex LLM privileges 主题。

用户还必须对正在处理的阶段和文件具有 READ 访问权限。

成本注意事项

成本由 处理的词元数 总数量决定,而不是按文件大小。上传文档时,会提取文本内容并将其转换为词元;视觉页面片段(图像)也会转换为词元。计费基于输入词元(模型读取的文本和图像)和输出词元(模型生成的文本)的总和。

实际词元计数因模型的底层架构以及文档组成和结构而异。密集表格、电子表格、结构化数据、代码、重复的页眉和页脚等内容,或者 OCR 派生的文本可能会增加词元数量。相反,包含大量图像或基于幻灯片且可提取文本最少的文档可能会导致词元计数较低。

备注

AI_COUNT_TOKENS 函数目前不支持多模式模型中的文档输入。

选择模型

`MMLongBench-Doc<https://proceedings.neurips.cc/paper_files/paper/2024/hash/ae0e43289bffea0c1fa34633fc608e92-Abstract-Datasets_and_Benchmarks_Track.html>`_ 基准测试用于评估模型在多模式和长上下文理解方面的能力,包括跨页面信息检索。

模型

MMLongBench-Doc 分数

claude-3-5-sonnet

46.8%

claude-3-7-sonnet

52.8%

claude-4-sonnet

50.2%

claude-4-opus

53.0%

claude-haiku-4-5

48.9%

claude-sonnet-4-5

61.4%

claude-opus-4-5

63.8%

claude-sonnet-4-6

62.3%

gemini-3-pro

60.5%

区域可用性

请参阅 可用性

错误条件

Snowflake Cortex AI_COMPLETE 可能会生成以下错误消息:

消息

解释

_COMPLETE_WITH_PROMPT_HISTORY_LLM $V1 出现远程服务错误:400 '"invalid request parameters: unsupported document content type: application/vnd.ms-excel"

所选文件的类型不受支持(在本例中为 Microsoft Excel 文件)。只有 Claude 模型支持 Excel 文件。

Request failed for external function _COMPLETE_WITH_PROMPT_HISTORY_LLM$V1 with remote service error:400 '"invalid request parameters:File data exceeds the limit of 10.00 MB for file prefix/file.pdf"

文件大小超过限制(在此示例中为 10MB)。

Remote file '@docs/file.pdf' was not found.There are several potential causes.The file might not exist.The required credentials may be missing or invalid.If you are running a copy command, please make sure files are not deleted when they are being loaded or files are not being loaded into two different tables concurrently with auto purge option.

文件名可能有错误。文件名区分大小写。或者文件可能已被删除。

Error in secure object

可能表示该暂存区不存在。检查暂存区名称并确保该暂存区存在且可访问。请务必在暂存区名称的开头使用 at 符号 (@)。确保暂存区使用服务器端加密。

Request failed for external function COMPLETE$V6 with remote service error:400 '"model "model_name" does not support given modality"

请求中提供的模型不支持文档或文本模态。

Request failed for external function _COMPLETE_WITH_PROMPT with remote service error:500 '"internal error"

在服务器端处理请求时出现问题。文件可能已损坏或被截断。