MergeDocumentElements 2025.3.28.13-SNAPSHOT

BUNDLE

com.snowflake.openflow.runtime | runtime-document-layout-nar

DESCRIPTION

Given a FlowFile that contains a full Document and one more FlowFiles that contain additional data to merge into the Document, this Processor will merge the additional data into the Document. This can be used, for instance, when a table or image has been extracted from a Document and analyzed with a deep learning model in order to glean insights. The derived information can then be merged back into the original Document using this Processor. For each FlowFile that does not contain the full Document, the Processor will create a Processing Element Representation whose ‘data’ element is the contents of the FlowFile, or will add the contents of the FlowFile to the metadata of the Container that the FlowFile belongs to, if the ‘Content Metadata Key’ property is set.

TAGS

assemble, combine, document, element, fragment, join, merge, openflow, rag, retrieval augmented generation, unstructured

INPUT REQUIREMENT

REQUIRED

Supports Sensitive Dynamic Properties

false

PROPERTIES

Property

Description

Character Set

The Character Set of all FlowFiles’ contents. It is required that all FlowFiles that are included have the same Character Set. If any FlowFile has binary content,the FlowFile’s contents must first be Base64 Encoded. In this case, it is recommended to include a metadata entry named ‘encoding’ with a value of ‘base64’.

Content Metadata Key

The key to use for the metadata entry that will contain the content of the FlowFile. If this property is set, the content of each of the FlowFiles will be placed into the Document Container’s metadata with the specified key. If not specified, the content of the FlowFile will be added as a Processing Element Representation in the document.

FlowFile Inclusion Filter

An Expression Language Expression that can be evaluated against each incoming FlowFile. If the result of the expression is true, the FlowFile will be included in the bin; otherwise, it will be ignored. When a FlowFile is split up and later merged, we must wait for all segments of the original FlowFile to arrive in order to merge them together. This property allows you to specify a filter that can be used to exclude some FlowFiles from the merged document, while still routing the FlowFile to the Processor in order to ensure that all segments of the FlowFile arrive.

Maximum number of Bins

Specifies the maximum number of bins that can be held in memory at any one time

Timeout

The amount of time to wait for all document fragments to arrive before merging the documents

RELATIONSHIPS

NAME

DESCRIPTION

failure

If unable to merge the document elements, the original document fragments are routed to this relationship.

partial

If only some of the document fragments arrive within the timeout period, those that have arrived are merged and routed to this relationship.

merged

The merged document is routed to this relationship when all document fragments have been merged together.

WRITES ATTRIBUTES

NAME

DESCRIPTION

eviction.reason

The reason that the bin was evicted. I.e., why the Processor determined it was time to merge the document and fragments together. This will be one of ‘MAX_ENTRIES_THRESHOLD_REACHED’ if all of the document elements were received. It will have a value of ‘TIMEOUT’ if the timeout period was reached before all document elements arrived. It will have a value of ‘BIN_MANAGER_FULL’ if the FlowFile was merged due to number of bins reaching the max allowed by the ‘Maximum number of Bins’ property.

eviction.explanation

A more use-friendly explanation as to why the bin was evicted.

mime.type

The MIME type will be set to application/json.

Use Cases Involving Other Components

Parse a PDF into a Document object, using OpenAI to summarize any Table that is found in the document.

SEE ALSO

Language: English