Tutorial: Create a document processing pipeline with Document AI¶
Introduction¶
With Document AI, you can process documents of various formats, and extract information from both text-heavy paragraphs and the images that contain text, such as logos, handwritten text (signatures), or checkmarks. This tutorial introduces you to Document AI by setting up the required objects and privileges, and creating a Document AI model build to use in a processing pipeline.
The tutorial uses the Snowsight web interface. For the steps that require using SQL, you can use any Snowflake client that supports executing SQL.
What you will learn¶
In this tutorial, you will learn how to:
Set up the objects and privileges required to work with Document AI.
Prepare a Document AI model build using Document AI user interface in Snowsight to extract data from unstructured documents.
Create a pipeline for continuous processing of new documents in stage, using a Document AI model build, and streams and tasks.
Prerequisites¶
The following prerequisites are required to complete this tutorial:
You must connect as a user that has the ACCOUNTADMIN role which is used to create a custom role used in this tutorial and grant the new role required privileges.
You must have a Snowflake account in one of the commercial regions supported for Document AI. For more information about supported regions, see Document AI availability.
You must have a warehouse ready to use with Document AI. For more information about selecting a warehouse, see Determining optimal warehouse size for Document AI.
Set up the required objects and privileges¶
In this section, you will:
Create a database and schema to contain the Document AI model build.
Create a custom role to prepare a Document AI model build and a document processing pipeline.
Grant the required privileges to the custom role.
Create a database, schema, and custom role¶
To create a database, schema, and a role to work with Document AI, do the following:
Create a database and schema in which to create a Document AI model build:
CREATE DATABASE doc_ai_db; CREATE SCHEMA doc_ai_db.doc_ai_schema;
Create custom role
doc_ai_role
to prepare the Document AI model build and to create processing pipelines:USE ROLE ACCOUNTADMIN; CREATE ROLE doc_ai_role;
Grant the required privileges¶
To grant the privileges required to work with Document AI, do the following:
Grant the SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR database role to the
doc_ai_role
role:GRANT DATABASE ROLE SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR TO ROLE doc_ai_role;
Grant warehouse usage and operating privileges to the
doc_ai_role
role:GRANT USAGE, OPERATE ON WAREHOUSE <your_warehouse> TO ROLE doc_ai_role;
Grant the privileges to use the database and schema you created to the
doc_ai_role
:GRANT USAGE ON DATABASE doc_ai_db TO ROLE doc_ai_role; GRANT USAGE ON SCHEMA doc_ai_db.doc_ai_schema TO ROLE doc_ai_role;
Grant the create stage privilege on the schema to the
doc_ai_role
role to store the documents for extraction:GRANT CREATE STAGE ON SCHEMA doc_ai_db.doc_ai_schema TO ROLE doc_ai_role;
Grant the privilege to create model builds (instances of the DOCUMENT_INTELLIGENCE class) to the
doc_ai_role
role:GRANT CREATE SNOWFLAKE.ML.DOCUMENT_INTELLIGENCE ON SCHEMA doc_ai_db.doc_ai_schema TO ROLE doc_ai_role;
Grant the privileges required to create a processing pipeline using streams and tasks to the
doc_ai_role
role:GRANT CREATE STREAM, CREATE TABLE, CREATE TASK, CREATE VIEW ON SCHEMA doc_ai_db.doc_ai_schema TO ROLE doc_ai_role; GRANT EXECUTE TASK ON ACCOUNT TO ROLE doc_ai_role;
Grant the
doc_ai_role
to tutorial user for use in the next steps of the tutorial:GRANT ROLE doc_ai_role TO USER <your_user_name>;
What you learned in this section¶
In this section you learned how to:
Create the database and schema to contain a Document AI model build.
Create the
doc_ai_role
custom role.Grant required privileges to the
doc_ai_role
role, and grant that role to the tutorial user.
Prepare a Document AI model build¶
In this section, you will prepare a Document AI model build by creating the model build and uploading documents to test the model.
The Document AI model build represents a single type of the document. For this tutorial, you will create a model build for extracting information from inspection reviews. The Document AI model build includes the model, the data values to be extracted, and the documents uploaded to test the model.
Create a Document AI model build¶
To create a Document AI model build, do the following:
Sign in to Snowsight.
In the navigation menu, select AI & ML » Document AI.
Select a warehouse.
Select + Build.
In the dialog that appears, enter
inspection_reviews
as a name for your model build, and select the location (doc_ai_db
database anddoc_ai_schema
schema).Select Create.
Upload documents to the Document AI model build¶
To upload documents to the newly created Document AI model build, do the following:
To obtain the documents required for the tutorial to test the model build, download the
zip file
to your local file system.Unzip the content, which includes PDF documents.
In the
inspection_reviews
model build, select the Build Details tab.Select Upload documents.
Select Browse or drag the documents you downloaded.
Select Upload.
What you learned in this section¶
In this section you learned how to:
Create a Document AI model build.
Upload documents to test the Document AI model build.
Define data values and review the results¶
In this section, you will define data values by asking the Document AI model questions in natural language. You will then review the answers that the model provides.
Data values are the information you want to extract from documents. A value consists of a value name and a question asked in natural language.
To define values for the Document AI model build:
In the
inspection_reviews
model build, select the Build Details tab.Select Define values.
In the Documents review view, select + Value.
For each document, enter the following pairs of value names and questions:
inspection_date
: What is the inspection date?inspection_grade
: What is the grade?inspector
: Who performed the inspection?list_of_units
: What are all the units?
For each document and data value, review the answers that the model provides:
If the answer is correct, select the checkmark.
If the answer is incorrect, enter the correct value manually.
What you learned in this section¶
In this section you learned how to:
Define data values to extract by asking the model questions in natural language.
Review results by confirming or correcting the answers that the model provided.
Publish a Document AI model build¶
In this section, you will publish the Document AI model build to use it for extraction in processing pipelines. Publishing the model build enables using the latest version of the model build in production.
To publish the model build, do the following:
In the
inspection_reviews
model build, select the Build Details tab.Under Model accuracy, select Publish version.
In the dialog that appears, select Publish to confirm.
Note
If the model accuracy and the results are not satisfactory, you can optionally fine-tune the model to improve it. Fine-tuning is not a part of this tutorial. For more information about evaluating and training the model, see Evaluate a Document AI model.
What you learned in this section¶
In this section you published the Document AI model build to use the model build for extraction in processing pipelines.
Create a document processing pipeline¶
In this section, you will create a processing pipeline using the already prepared Document AI model build, streams, and tasks. The pipeline will extract information from new inspection documents stored in an internal stage.
To create a processing pipeline:
Set up the pipeline using streams and tasks.
Upload new documents to an internal stage.
View the extracted information.
Set up the processing pipeline¶
Create an internal
my_pdf_stage
stage to store the documents:CREATE OR REPLACE STAGE my_pdf_stage DIRECTORY = (ENABLE = TRUE) ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');
Create a
my_pdf_stream
stream on amy_pdf_stage
stage:CREATE STREAM my_pdf_stream ON STAGE my_pdf_stage;
Refresh the metadata of the directory table that will store the staged document files:
ALTER STAGE my_pdf_stage REFRESH;
Specify the database and schema:
USE DATABASE doc_ai_db; USE SCHEMA doc_ai_schema;
Create a
pdf_reviews
table to store the information about the documents (such asfile_name
) and the data to be extracted from the PDF documents:CREATE OR REPLACE TABLE pdf_reviews ( file_name VARCHAR, file_size VARIANT, last_modified VARCHAR, snowflake_file_url VARCHAR, json_content VARCHAR );
The
json_content
column will include the extracted information in JSON format.Create a
load_new_file_data
task to process new documents in the stage:CREATE OR REPLACE TASK load_new_file_data WAREHOUSE = <your_warehouse> SCHEDULE = '1 minute' COMMENT = 'Process new files in the stage and insert data into the pdf_reviews table.' WHEN SYSTEM$STREAM_HAS_DATA('my_pdf_stream') AS INSERT INTO pdf_reviews ( SELECT RELATIVE_PATH AS file_name, size AS file_size, last_modified, file_url AS snowflake_file_url, inspection_reviews!PREDICT(GET_PRESIGNED_URL('@my_pdf_stage', RELATIVE_PATH), 1) AS json_content FROM my_pdf_stream WHERE METADATA$ACTION = 'INSERT' );
Note that newly created tasks are automatically suspended.
Start the newly created task:
ALTER TASK load_new_file_data RESUME;
Note
Document AI does not support serverless tasks.
Upload new documents to an internal stage¶
To obtain the documents required for the tutorial, download the
zip file
to your local file system.Unzip the content, which includes PDF files.
In Snowsight, select Data » Databases.
Select the
doc_ai_db
database, thedoc_ai_schema
, and themy_pdf_stage
stage.Select + Files.
In the Upload Your Files dialog that appears, select the files you just downloaded.
Select Upload.
View the extracted information¶
After uploading the documents to the stage, view the information extracted from new documents:
SELECT * FROM pdf_reviews;
Create a
pdf_reviews_2
table to analyze the extracted information in separate columns:CREATE OR REPLACE TABLE doc_ai_db.doc_ai_schema.pdf_reviews_2 AS ( WITH temp AS ( SELECT RELATIVE_PATH AS file_name, size AS file_size, last_modified, file_url AS snowflake_file_url, inspection_reviews!PREDICT(get_presigned_url('@my_pdf_stage', RELATIVE_PATH), 1) AS json_content FROM directory(@my_pdf_stage) ) SELECT file_name, file_size, last_modified, snowflake_file_url, json_content:__documentMetadata.ocrScore::FLOAT AS ocrScore, f.value:score::FLOAT AS inspection_date_score, f.value:value::STRING AS inspection_date_value, g.value:score::FLOAT AS inspection_grade_score, g.value:value::STRING AS inspection_grade_value, i.value:score::FLOAT AS inspector_score, i.value:value::STRING AS inspector_value, ARRAY_TO_STRING(ARRAY_AGG(j.value:value::STRING), ', ') AS list_of_units FROM temp, LATERAL FLATTEN(INPUT => json_content:inspection_date) f, LATERAL FLATTEN(INPUT => json_content:inspection_grade) g, LATERAL FLATTEN(INPUT => json_content:inspector) i, LATERAL FLATTEN(INPUT => json_content:list_of_units) j GROUP BY ALL );
View the output:
SELECT * FROM pdf_reviews_2;
The table uses the FLATTEN function to parse the json_content
JSON into separate columns for easier viewing.
The table contains the data values (such as inspection_grade_value
, inspection_date_value
) that were defined when the model build was prepared for
inspection documents, and the corresponding confidence scores (inspection_grade_score
, inspection_date_score
).
What you learned in this section¶
In this section you learned how to:
Create an internal stage to store new documents.
Create a stream and a task required to prepare a processing pipeline.
Upload documents to an internal stage.
View the extracted information in a table.
Learn more¶
Congratulations! You have successfully completed this tutorial. You are now ready to start working with Document AI on your own use cases.
Along the way, you learned how to:
Set up the required objects and privileges to work with Document AI.
Create a Document AI model build.
Upload documents to the Document AI model build to test the model.
Define the data values to extract by asking the model questions in natural language.
Review the results by confirming or correcting the answers that the model provides.
Publish the Document AI model build to use the model build for extraction in processing pipelines.
Prepare a document processing pipeline by creating a stream and a task and using the Document AI model build to extract information from new documents.
Additional resources¶
Continue learning using the following resources: