Snowpark Python: Eliminate repeated subqueries in Snowpark-generated queries (Preview)

Attention

This behavior change is in the 2025_04 bundle.

For the current status of the bundle, refer to Bundle History.

Repeated subquery elimination identifies identical sub-DataFrames within a query plan and employs Common Table Expressions (CTEs) to construct the final query. Almost half of the queries with compilation times exceeding one second contain at least one redundant subquery. The benefit of this optimization scales with the quantity and complexity of the identified duplicate subqueries.

  • Diagnostic steps:

    • SQL compilation error on a Snowpark data pipeline that was running with no errors previously.

    • Wrong results generated due to bugs in SQL generation.

  • Mitigation:

    • Downgrading to an older version of Snowpark client mitigates the issue.

    • Unsetting parameter PYTHON_SNOWPARK_USE_CTE_OPTIMIZATION_VERSION mitigates the issue.

      To be unaffected by this change, keep Snowpark-based workflows pinned to a Snowpark Python version that is lower than 1.31.1. For example, if you are using a Python stored procedure, set PACKAGES=('snowflake-snowpark-python==1.30.0') when creating the stored procedure. In the case of a Snowflake Notebook or Python worksheet, switch to a Snowpark Python version lower than 1.31.1.

To demonstrate the difference between old and new behavior, consider the following DataFrame transformations in Snowpark Python:

df = session.table("test_table")
df1 = df.with_column("a", F.col("A") + 1).filter(df.a > 1)
df1 = df1.union_all(df1)

print(df1.queries["queries"][0])
Copy
Before the change:

Because the union_all above is using the same DataFrame df1 twice, the generated SQL queries will repeat the underlying subquery twice:

( SELECT * FROM ( SELECT "B", "C", ( "A" + 1 ) AS "A" FROM test_table )
  WHERE ( "A" > 1 ) )
UNION ALL
( SELECT * FROM ( SELECT "B", "C", ( "A" + 1 ) AS "A" FROM test_table )
  WHERE ( "A" > 1 ) )
Copy
After the change:

The optimization will detect that df1 is being used twice, will replace the subquery with a CTE expression, and then use that to build the query:

WITH SNOWPARK_TEMP_CTE_7G3ZFVJYBK AS
  ( SELECT * FROM ( SELECT "B", "C", ( "A" + 1 ) AS "A" FROM test_table )
      WHERE ( "A" > 1 ) ) ( SELECT * FROM ( SNOWPARK_TEMP_CTE_7G3ZFVJYBK ) )
  UNION ALL
  ( SELECT * FROM ( SNOWPARK_TEMP_CTE_7G3ZFVJYBK ) )
Copy

Ref: 1995

Language: English