Fine-tuning arctic-extract models

You can fine-tune arctic-extract models using the Snowflake Cortex Fine-tuning function and Snowflake Datasets. The fine-tuned model can then be used for inference with the AI_EXTRACT function.

语法

有关具体语法、使用说明和示例,请参阅:

FINETUNE (‘CREATE’) (SNOWFLAKE.CORTEX)

创建微调作业。

语法

SNOWFLAKE.CORTEX.FINETUNE(
  'CREATE',
  '@<database>.<schema>.<model_name>',
  'arctic-extract',
  '<training_dataset>'
  [
    , '<validation_dataset>'
    [, '<options>' ]
  ]
)

必填参数

'CREATE'

指定您要创建微调作业。

'training_dataset'

Dataset object to use for training. For more information, see Dataset 要求.

可选参数

'validation_dataset'

Dataset object to use for validation. For more information, see Dataset 要求.

'options'

Optional. A string representation of a JSON object that sets training hyperparameters for the job. You can specify max_epochs with an integer from 2 through 10 (inclusive) to control how many epochs the job runs. If you omit options, the number of epochs is determined automatically by the system.

For the JSON format of the options argument, see FINETUNE (‘CREATE’) (SNOWFLAKE.CORTEX).

访问控制要求

权限对象备注
USAGE 或 OWNERSHIPDATABASE用于存储数据集对象的数据库。
USAGE 或 OWNERSHIPSCHEMA用于存储数据集对象的架构。
READ 或 OWNERSHIPSTAGEThe internal or external named stage that stores the document files. For more information, see Snowflake Cortex AI Functions (including LLM functions).
USAGE 或 OWNERSHIPSCHEMA用于存储微调后模型的架构。
CREATE MODELSCHEMA用于存储微调后模型的架构。

Additionally, to use the FINETUNE function, the ACCOUNTADMIN role must grant the SNOWFLAKE.CORTEX_USER database role to the user who will call the function. See LLM Functions required privileges topic for details.

示例

  SELECT SNOWFLAKE.CORTEX.FINETUNE(
  'CREATE',
  '@database.schema.model_name',
  'arctic-extract',
  'snow://dataset/training_ds/versions/2',
  'snow://dataset/validation_ds/versions/4'
);

The following example adds the options argument to set max_epochs:

  SELECT SNOWFLAKE.CORTEX.FINETUNE(
  'CREATE',
  '@database.schema.model_name',
  'arctic-extract',
  'snow://dataset/training_ds/versions/2',
  'snow://dataset/validation_ds/versions/4',
  '{"max_epochs": 3}'
);

FINETUNE (‘DESCRIBE’) (SNOWFLAKE.CORTEX)

描述微调作业的属性。

For syntax and parameters, see FINETUNE (‘DESCRIBE’) (SNOWFLAKE.CORTEX).

An example output for a successful job when fine-tuning arctic-extract model:

{
  "base_model":"arctic-extract",
  "created_on":1717004388348,
  "finished_on":1717004691577,
  "id":"ft_6556e15c-8f12-4d94-8cb0-87e6f2fd2299",
  "model":"mydb.myschema.my_tuned_model",
  "progress":1.0,
  "status":"SUCCESS",
  "training_data":"snow://dataset/training_ds/versions/2",
  "trained_tokens":2670734,
  "training_result":{"validation_loss":1.0138969421386719,"training_loss":0.6477728401547047},
  "validation_data":"snow://dataset/validation_ds/versions/4",
}

Dataset 要求

The Dataset used for training and validation must contain the following columns:

File:

A string containing the file path to the document for extraction. The path can reference a file on an internal stage or a named external stage (for example, Amazon S3, Google Cloud Storage, or Microsoft Azure). For example: @db.schema.stage/file.pdf

Prompt:

A JSON value that specifies key and question pairs for extraction in one of the formats supported by the responseFormat argument of the AI_EXTRACT function.

For more information, see AI_EXTRACT.

Response:

包含密钥和响应对的 JSON 对象。

Note

Column names are case-insensitive and can be in any order in the Dataset; however, all required columns (File, Prompt, and Response) must be present for the Dataset to be valid. Additional columns in the Dataset are ignored.

准备数据集时,请注意以下事项:

  • 微调后模型的架构是数据集中所有问题的唯一集合。
  • The answers in the Response column should match the questions in the Prompt column by matching keys in the Prompt and Response columns.
  • 您不必为每个文档指定相同的问题集。
  • 要提高模型准确性,请为每个问题添加提示和响应行,即使模型的默认响应正确也应如此。此操作确认默认答案准确无误。

For more information about Datasets, see Snowflake Datasets.

数据集示例

FilePromptResponse
file1.pdf{"date": "What is the date?", "total": "What is the total amount?"}{"date": "2024-06-30", "total": "82.50"}
file2.pdf[["invoice_number", "What is the invoice number?"], ["vendor", "What is the vendor name?"]]{"invoice_number": "543433434", "vendor": "Example Corp"}
file3.pdf
{
  "schema":
  {
    "type": "object",
    "properties": {
      "deductions": {
        "description": "Deductions",
        "type": "object",
        "properties": {
          "deductions_name": {
            "type": "array"
          },
          "current": {
            "type": "array"
          }
        }
      }
    }
  }
}
{
  "deductions": {
    "deductions_name": [
      "Federal Tax",
      "Wyoming State Tax",
      "SDI",
      "Soc Sec / OASDI",
      "Health Insurance Tax",
      "None"
    ],
    "current": [
      "82.50",
      "64.08",
      "None",
      "13.32",
      "91.74",
      "21.46"
    ]
  }
}

Note

When you create the Dataset, set the response to None if the document does not contain an answer to the question.

使用说明

  • Snowflake 建议至少使用 20 个文档进行微调。

  • In the training Dataset, at most 100 unique questions are supported for entity extraction, and at most 10 unique questions are supported for table extraction.

  • Training and validation documents can reside on an internal stage or a named external stage. For access requirements and setup when you use cloud storage, see Snowflake Cortex AI Functions (including LLM functions).

  • Client-side encrypted stages are not supported. For more information, see AI_EXTRACT.

  • Fine-tuning arctic-extract models is currently incompatible with custom network policies.

  • 支持的文档文件格式包括:

    • PDF
    • PNG
    • JPG、JPEG
    • TIFF、TIF
  • 每个文档的最大页数为:

    • 64 页,适用于 AWS US 西部 2(俄勒冈州)和 AWS 欧洲中部 1(法兰克福)
    • 125 页,适用于 AWS US 东部 1(弗吉尼亚北部)和 Azure 东部 US 2(弗吉尼亚)
  • 数据集中独立文档文件的最大数量为 1,000。您可以多次引用同一个文档文件。

  • A limit exists on how many questions and documents can be in a fine-tuning job. Number of questions multiplied by total number of pages in all document files in the Dataset must be equal or less than 50,000.

例如,一些有效的组合如下:

| | 问题数 | 页数 | Number of document file references [1] | | ------------------- | --------------- | --------------------------------------------------------------------------- | | | 10 | 1 | 5,000 | | | 100 | 1 | 500 | | | 10 | 10 | 500 | | | 25 | 10 | 200 |

创建微调作业。

To create a fine-tuning job, you must create a Dataset object that contains the training data. The following example shows how to create a Dataset object and use the Dataset to create a fine-tuning job for an arctic-extract model.

  1. 创建包含训练数据的表:

    CREATE OR REPLACE TABLE my_data_table (f FILE, p VARCHAR, r VARCHAR);
  2. 使用训练数据填充表:

    INSERT INTO my_data_table (f, p, r)
    SELECT TO_FILE('@db.schema.stage', '1.pdf'), '{"net": "What is the net value?"}', '{"net": "3,762.56"}';
  3. 创建数据集对象:

    CREATE OR REPLACE DATASET my_dataset;
  4. Create a new version of the Dataset that adds the training data, using the FL_GET_STAGE and the FL_GET_RELATIVE_PATH functions to get the file paths:

    ALTER DATASET my_dataset
    ADD VERSION 'v1' FROM (
      SELECT FL_GET_STAGE(f) || '/' || FL_GET_RELATIVE_PATH(f) AS "file",
        p AS "prompt",
        r AS "response"
      FROM my_data_table
    );
  5. 创建微调作业:

    SELECT SNOWFLAKE.CORTEX.FINETUNE(
      'CREATE',
      'my_tuned_model',
      'arctic-extract',
      'snow://dataset/db.schema.my_dataset/versions/v1'
    );

Use your fine-tuned arctic-extract model for inference

To use the fine-tuned arctic-extract model for inference, ensure you have the following privileges on the model object:

  • OWNERSHIP
  • USAGE
  • READ

To use the fine-tuned arctic-extract model for inference with the AI_EXTRACT function, specify the model using the model parameter as shown in the following example:

SELECT AI_EXTRACT(
  model => 'db.schema.my_tuned_model',
  file => TO_FILE('@db.schema.files','document.pdf')
);

You can overwrite questions used for fine-tuning by using the responseFormat parameter as shown in the following example:

SELECT AI_EXTRACT(
  model => 'db.schema.my_tuned_model',
  file => TO_FILE('@db.schema.files','document.pdf'),
  responseFormat => [['name', 'What is the first name of the employee?'], ['city', 'Where does the employee live?']]
);

For more information, see AI_EXTRACT.

Tip

You can copy your fine-tuned arctic-extract model between databases and/or schemas within an account or between accounts. For more information, see Copy arctic-extract models between databases, schemas, and accounts.