PerformOCR 2025.3.28.13-SNAPSHOT

BUNDLE

com.snowflake.openflow.runtime | runtime-ocr-nar

DESCRIPTION

Uses the Openflow Tesseract OCR Service to extract text from a PDF or image, optionally providing metadata including the bounding box, page numberand confidence level of the OCR.

TAGS

extract, image, jpeg, jpg, ocr, openflow, pdf, png, tesseract, text

INPUT REQUIREMENT

REQUIRED

Supports Sensitive Dynamic Properties

false

PROPERTIES

Property

Description

Confidence Threshold

The minimum confidence level required for a text block to be included in the output. Text blocks with a confidence level below this value will be excluded.

Extract PDF Text

If true, the processor will attempt to extract text directly from the PDF files, rather than performing OCR. This can be more efficient and provide better results in many cases. In the case that text is not available in the PDF, OCR will be performed regardless of this setting.

MIME Type

The MIME Type of the input FlowFile. This is used to determine the format of the input data.

OCR Service

An OCR Service for reading files to output text.

Record Writer

Specifies the Controller Service to use for writing the results. If not specified, the results will be written to the FlowFile as plaintext.If the Record Writer is specified, each text block will be output as an individual Record. In this case, the Record will contain not only the textthat was found but also the bounding box in the image/pdf where the text was found, as well as the page number and the confidence level of the OCR.Each Record will have the following fields: text, x, y, height, width, pageNumber, and confidence.

RELATIONSHIPS

NAME

DESCRIPTION

failure

If the text of a FlowFile cannot be extracted for any reason, the input FlowFile will be routed to this relationship.

comms.failure

If the processor is unable to communicate with the Tesseract OCR Service, the input FlowFile will be routed to this relationship.

success

The text of the PDF is routed to the success relationship.

WRITES ATTRIBUTES

NAME

DESCRIPTION

mime.type

The MIME Type of the FlowFile.

text.extraction.method

The method used to extract the text from the FlowFile. This will be either ‘PdfExtraction’ or ‘OCR’.

SEE ALSO

Language: English