- Categories:
String & binary functions (AI Functions)
SPLIT_TEXT_MARKDOWN_HEADER (SNOWFLAKE.CORTEX)¶
The SPLIT_TEXT_MARKDOWN_HEADER function splits a Markdown-formatted document into structured text chunks based on header levels. The function returns an array of objects, where each object contains the text chunk and the associated headers under which that chunk falls.
This function is useful for preserving document structure when chunking content for embedding, retrieval-augmented generation (RAG), or search indexing.
The function first segments the input text using the specified Markdown headers, and then recursively
splits each segment using default plain text separators (e.g., ["nn", "n", " ", ""]) to produce chunks
of the desired size.
Syntax¶
Arguments¶
Required:
'text_to_split'A Markdown-formatted string to be split.
'headers_to_split_on'A key-value map in which the keys are Markdown header syntax (e.g.,
#,##) and the values are metadata field names (e.g.,header_1,header_2) to label the chunks. For example:This configuration will split the document on
#and##headers. In the output,header_1andheader_2fields will contain the corresponding header text values.chunk_sizeAn integer specifying the maximum number of characters in each chunk. The value must be greater than zero.
Optional:
overlapAn integer specifying the number of characters to overlap between consecutive chunks. Defaults to 0 if not provided.
Overlap is useful for maintaining context across chunks, which can improve performance in embedding and retrieval tasks.
Returns¶
Returns an array of objects. Each object has the following structure:
chunk: A string containing the extracted text.headers: A dictionary containing the Markdown header values under which the chunk is nested. Keys match those provided in theheaders_to_split_onmap.
Examples¶
Simple usage¶
The following example splits a Markdown string on both # and ## headers, produces chunks of up to 12 characters, and applies a 5-character overlap between chunks.
Example with Markdown formatting and flattening of results into rows¶
The following example creates a table markdown_docs containing a short Markdown document in each row, then
calls the SPLIT_TEXT_MARKDOWN_HEADER function to segment each document on markdown headers ‘#’ and ‘##’. The function
then splits each segment into chunks of 20 characters each, with an overlap of 5 characters between chunks.