Using the Snowpark XML RowTag Reader¶
You can activate the Snowpark XML RowTag Reader by specifying .option("rowTag", "<rowtag>") in session.read.option("rowTag", "<rowtag>").xml(). Instead of loading the entire document as a single object, this mode splits the file based on the specified rowTag, loads each matching element as a separate row, and splits each row into multiple columns in a Snowpark DataFrame. The Reader is especially useful for processing only selective elements in XML files or ingesting large XML files in a scalable, Snowpark-native way.
示例¶
This sample XML is an example:
<library>
<book id="1">
<title>The Art of Snowflake</title>
<author>Jane Doe</author>
<price>29.99</price>
<reviews>
<review>
<user>tech_guru_87</user>
<rating>5</rating>
<comment>Very insightful and practical.</comment>
</review>
<review>
<user>datawizard</user>
<rating>4</rating>
<comment>Great read for data engineers.</comment>
</review>
</reviews>
<editions>
<edition year="2023" format="Hardcover"/>
<edition year="2024" format="eBook"/>
</editions>
</book>
<book id="2">
<title>XML for Data Engineers</title>
<author>John Smith</author>
<price>35.50</price>
<reviews>
<review>
<user>xml_master</user>
<rating>5</rating>
<comment>Perfect for mastering XML parsing.</comment>
</review>
</reviews>
<editions>
<edition year="2022" format="Paperback"/>
</editions>
</book>
</library>
Snowpark 脚本¶
df = session.read.option("rowTag", "book").xml("@mystage/books.xml")
这会将每个 <book> 元素从 XML 文件加载到自己的行中,其中子元素(例如,<title> 和 <author>)会自动提取为 VARIANT 类型的列。
输出¶
|
|
|
|
|
|
|---|---|---|---|---|---|
"2" |
"John Smith" |
|
"35.50" |
|
"XML for Data Engineers" |
"1" |
"Jane Doe" |
|
"29.99" |
|
"The Art of Snowflake" |
由
rowTag标识的每个 XML 元素都变成一行。该标签中的每个子元素都成为一列,存储为
VARIANT。嵌套元素被捕获为嵌套VARIANT数据。The resulting DataFrame is flattened and columnized and behaves like any other Snowpark DataFrame.
开始使用¶
安装 Snowpark Python 包:
pip install snowflake-snowpark-python
将 XML 文件上传到 Snowflake 暂存区:
PUT file:///path/to/books.xml @mystage;
使用 Snowpark 读取 XML 文件:
df = session.read.option("rowTag", "book").xml("@mystage/books.xml")
使用 DataFrame 方法进行转换或保存:
df.select(col("`title`"), col("`author`")).show() df.write.save_as_table("books_table")
支持的选项¶
:code:`rowTag`(必填):要提取为行的 XML 元素的名称。
:code:`rowValidationXSDPath`(可选):用于在加载过程中验证每个 rowTag 片段的 XSD 的暂存区路径。
mode`(可选):默认行为加载时无需验证。当设置 :code:`rowValidationXSDPath时:PERMISSIVE: Quarantines invalid rows in_corrupt_record; loads the rest.FAILFAST: Stops at the first invalid row and raises an error.
有关 XML 选项的更多信息,请参阅 snowflake.snowpark.DataFrameReader.xml。
Validate XML using XSD¶
To validate each
rowTagfragment against an XSD during load, set the XSD path and choose a validation mode:df = ( session.read .option("rowTag", "book") .option("rowValidationXSDPath", "@mystage/schema.xsd") # validates each row element .option("mode", "PERMISSIVE") # or "FAILFAST" .xml("@mystage/books.xml") )
PERMISSIVE: Invalid rows are quarantined in a special _corrupt_record column; valid rows load normally.
To persist the result, write the DataFrame to a table with
df.write.save_as_table("<table_name>"). The table will include all parsed columns plus an extra_corrupt_recordcolumn: it isNULLfor valid rows and contains the full XML records for invalid rows (with the other columns showingNULL).+-------------------+ | _corrupt_record | | <book id="1"> ... | | <book id="2"> ... | +-------------------+
FAILFAST:选择使用 时默认使用的角色和仓库。读取操作会在第一个有问题的行处停止并返回错误。
限制¶
Snowpark XML RowTag Reader 具有以下限制:
不推断架构,并且输出列都是
VARIANT类型。Only supports files stored in Snowflake stages; local files are not supported.
仅在 Snowpark Python 库中可用。