Using Snowpark XML RowTag Reader¶
Specifying .option("rowTag", "<rowtag>") in session.read.option("rowTag", "<rowtag>").xml() activates the Snowpark XML RowTag Reader. Instead of loading the entire document as a single object, this mode splits the file based on the specified rowTag, loads each matching element as a separate row, and splits each row into multiple columns into a Snowpark DataFrame. It is especially useful for processing only selective elements in XML files, or ingesting large XML files in a scalable, Snowpark-native way.
Example¶
Take the sample XML as an example:
<library>
<book id="1">
<title>The Art of Snowflake</title>
<author>Jane Doe</author>
<price>29.99</price>
<reviews>
<review>
<user>tech_guru_87</user>
<rating>5</rating>
<comment>Very insightful and practical.</comment>
</review>
<review>
<user>datawizard</user>
<rating>4</rating>
<comment>Great read for data engineers.</comment>
</review>
</reviews>
<editions>
<edition year="2023" format="Hardcover"/>
<edition year="2024" format="eBook"/>
</editions>
</book>
<book id="2">
<title>XML for Data Engineers</title>
<author>John Smith</author>
<price>35.50</price>
<reviews>
<review>
<user>xml_master</user>
<rating>5</rating>
<comment>Perfect for mastering XML parsing.</comment>
</review>
</reviews>
<editions>
<edition year="2022" format="Paperback"/>
</editions>
</book>
</library>
Snowpark script¶
df = session.read.option("rowTag", "book").xml("@mystage/books.xml")
This loads each <book> element from the XML file into its own row, with child elements (for example, <title> and <author>) automatically extracted as columns of type VARIANT.
Output¶
|
|
|
|
|
|
|---|---|---|---|---|---|
“2” |
“John Smith” |
|
“35.50” |
|
“XML for Data Engineers” |
“1” |
“Jane Doe” |
|
“29.99” |
|
“The Art of Snowflake” |
Each XML element identified by
rowTagbecomes one row.Each sub-element within that tag becomes a column, stored as a
VARIANT. Nested elements are captured as nestedVARIANTdata.The resulting DataFrame is flattened and columnized, and behaves like any other Snowpark DataFrame.
Getting started¶
Install the Snowpark Python package:
pip install snowflake-snowpark-python
Upload your XML file to a Snowflake stage:
PUT file:///path/to/books.xml @mystage;
Use Snowpark to read the XML file:
df = session.read.option("rowTag", "book").xml("@mystage/books.xml")
Use DataFrame methods to transform or save:
df.select(col("`title`"), col("`author`")).show() df.write.save_as_table("books_table")
Supported options¶
rowTag(Required): The name of the XML element to extract as a row.rowValidationXSDPath(Optional): Stage path to an XSD used to validate each rowTag fragment during load.mode(Optional): Default behavior loads without validation. WhenrowValidationXSDPathis set:PERMISSIVE- Quarantines invalid rows in_corrupt_record; loads the rest.FAILFAST- Stops at the first invalid row and raises an error.
For more information about XML options, see snowflake.snowpark.DataFrameReader.xml.
Validating XML using XSD¶
You can validate each rowTag fragment against an XSD during load. Set the XSD path and choose a validation mode:
df = (
session.read
.option("rowTag", "book")
.option("rowValidationXSDPath", "@mystage/schema.xsd") # validates each row element
.option("mode", "PERMISSIVE") # or "FAILFAST"
.xml("@mystage/books.xml")
)
PERMISSIVE: Invalid rows are quarantined in a special _corrupt_record column; valid rows load normally. To persist the result, write the DataFrame to a table with df.write.save_as_table("<table_name>"). The table will include all parsed columns plus an extra _corrupt_record column: it is NULL for valid rows and contains the full XML records for invalid rows (with the other columns showing NULL).
+-------------------+
| _corrupt_record |
+-------------------+
| <book id="1"> ... |
| <book id="2"> ... |
+-------------------+
FAILFAST: The read stops at the first offending row and returns an error.
Limitations¶
Snowpark XML RowTag Reader has the following limitations:
Doesn’t infer schema, and the output columns are all of type
VARIANT.Only supports files stored in Snowflake stages. Local files are not supported.
Is available only in the Snowpark Python library.