DataFrame support for Snowpark Connect for Spark¶
Snowpark Connect for Spark provides compatibility with the PySpark 3.5.3 Spark Connect DataFrame API (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html), allowing you to run Spark workloads on Snowflake. This page details which APIs are supported and their compatibility levels. The DataFrame API is shared across PySpark, Java, and Scala clients.
Compatibility level definitions¶
- Full compatibility
APIs with full compatibility behave identically to native PySpark. You can use these APIs with confidence that results will match exactly.
- High compatibility
APIs with high compatibility work correctly but might have minor differences:
Error message formatting might differ.
Output display format might vary (such as decimal precision, column name casing).
Edge cases might produce slightly different results.
- Partial compatibility
APIs with partial compatibility are functional but have notable limitations:
Only a subset of functionality might be available.
Behavior might differ from PySpark in specific scenarios.
Additional configuration might be required.
Performance characteristics might differ.
- Unsupported
APIs that are not currently implemented or cannot be supported on Snowflake.
Full compatibility APIs¶
Group |
Method |
Description |
|---|---|---|
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
cache() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.cache.html) |
Persists the DataFrame with the default storage level ( |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
coalesce(numPartitions) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.coalesce.html) |
Returns a new DataFrame that has exactly |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
collect() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.collect.html) |
Returns all the records as a list of |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
count() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.count.html) |
Returns the number of rows in this DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
crossJoin(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.crossJoin.html) |
Returns the cartesian product with another DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
dropDuplicates([subset]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html) |
Returns a new DataFrame with duplicate rows removed, optionally considering only a subset of columns. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
drop_duplicates([subset]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.drop_duplicates.html) |
Alias for |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
dropna([how, thresh, subset]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropna.html) |
Returns a new DataFrame omitting rows with null values. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
fillna(value[, subset]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.fillna.html) |
Replaces null values with the specified value. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
first() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.first.html) |
Returns the first row as a |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
head([n]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.head.html) |
Returns the first |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
isEmpty() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.isEmpty.html) |
Returns |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
join(other[, on, how]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html) |
Joins with another DataFrame, using the given join expression. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
limit(num) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.limit.html) |
Limits the result count to the number specified. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
melt(ids, values, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.melt.html) |
Unpivots a DataFrame from wide format to long format. Alias for |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
offset(num) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.offset.html) |
Returns a new DataFrame by skipping the first |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
persist([storageLevel]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.persist.html) |
Sets the storage level to persist the contents of the DataFrame across operations. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
repartitionByRange(numPartitions, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.repartitionByRange.html) |
Returns a new DataFrame partitioned by the given partitioning expressions into
|
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
replace(to_replace[, value, subset]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.replace.html) |
Returns a new DataFrame replacing a value with another value. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
select(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.select.html) |
Projects a set of expressions and returns a new DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
show([n, truncate, vertical]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.show.html) |
Prints the first |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
tail(num) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.tail.html) |
Returns the last |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
take(num) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.take.html) |
Returns the first |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
toDF(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toDF.html) |
Returns a new DataFrame that with new column names. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
toLocalIterator([prefetchPartitions]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toLocalIterator.html) |
Returns an iterator that contains all of the rows in this DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
toPandas() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toPandas.html) |
Returns the contents of this DataFrame as a Pandas |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
unionAll(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unionAll.html) |
Returns a new DataFrame containing the union of rows in this and another DataFrame.
Alias for |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
unpersist([blocking]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unpersist.html) |
Marks the DataFrame as non-persistent, and removes all blocks for it from memory and disk. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
unpivot(ids, values, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unpivot.html) |
Unpivots a DataFrame from wide format to long format, optionally leaving identifier columns set. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
where(condition) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.where.html) |
Alias for |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
withColumnsRenamed(colsMap) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumnsRenamed.html) |
Returns a new DataFrame by renaming multiple columns. This is a no-op if the schema doesn’t contain the given column names. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
asc() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.asc.html) |
Returns a sort expression based on ascending order of the column. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
between(lowerBound, upperBound) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.between.html) |
Checks if values of this expression are between the given columns. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
contains(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.contains.html) |
Contains the other element. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
desc() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.desc.html) |
Returns a sort expression based on descending order of the column. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
eqNullSafe(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.eqNullSafe.html) |
Equality test that is safe for null values. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
getItem(key) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.getItem.html) |
Gets an item at position |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
isNull() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.isNull.html) |
Returns |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
isin(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.isin.html) |
Returns a boolean |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
like(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.like.html) |
SQL LIKE expression. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
otherwise(value) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.otherwise.html) |
Evaluates a list of conditions and returns one of multiple possible result expressions. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
startswith(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.startswith.html) |
String starts with. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
substr(startPos, length) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.substr.html) |
Returns a |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
when(condition, value) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.when.html) |
Evaluates a list of conditions and returns one of multiple possible result expressions. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
range(start[, end, step, numPartitions]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.range.html) |
Creates a DataFrame with a single column named |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
sql(sqlQuery, args, **kwargs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.sql.html) |
Returns a DataFrame representing the result of the given query. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
table(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.table.html) |
Returns the specified table as a DataFrame. |
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) |
agg(*exprs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.agg.html) |
Computes aggregates and returns the result as a DataFrame. |
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) |
mean(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.mean.html) |
Computes average values for each numeric column for each group. |
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) |
pivot(pivot_col[, values]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.pivot.html) |
Pivots a column of the current DataFrame and performs the specified aggregation. |
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
table(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.table.html) |
Returns the specified table as a DataFrame. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
mode(saveMode) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.mode.html) |
Specifies the behavior when data or table already exist. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
saveAsTable(name[, format, mode, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.saveAsTable.html) |
Saves the content of the DataFrame as the specified table. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
text(path[, compression, lineSep]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.text.html) |
Saves the content of the DataFrame in a text file at the specified path. |
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
replace() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.replace.html) |
Replaces data in the existing table. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
cacheTable(tableName[, storageLevel]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.cacheTable.html) |
Caches the specified table in-memory. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
clearCache() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.clearCache.html) |
Removes all cached tables from the in-memory cache. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
dropGlobalTempView(viewName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.dropGlobalTempView.html) |
Drops the global temporary view with the given view name. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
dropTempView(viewName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.dropTempView.html) |
Drops the local temporary view with the given view name. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
isCached(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.isCached.html) |
Returns |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
refreshByPath(path) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.refreshByPath.html) |
Invalidates and refreshes all the cached data for any DataFrame that contains the given data source path. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
refreshTable(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.refreshTable.html) |
Invalidates and refreshes all the cached data and metadata of the given table. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
uncacheTable(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.uncacheTable.html) |
Removes the specified table from the in-memory cache. |
Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
partitionBy(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.partitionBy.html) |
Creates a |
Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
orderBy(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.orderBy.html) |
Creates a |
Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
rangeBetween(start, end) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.rangeBetween.html) |
Creates a |
Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
rowsBetween(start, end) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.rowsBetween.html) |
Creates a |
Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
|
Value representing the first row in the partition, for use in frame boundary definition. |
Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
|
Value representing the last row in the partition, for use in frame boundary definition. |
Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
|
Value representing the current row, for use in frame boundary definition. |
WindowSpec (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
partitionBy(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.WindowSpec.partitionBy.html) |
Defines the partitioning columns in a |
WindowSpec (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
orderBy(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.WindowSpec.orderBy.html) |
Defines the ordering columns in a |
WindowSpec (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
rangeBetween(start, end) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.WindowSpec.rangeBetween.html) |
Defines the frame boundaries, from |
WindowSpec (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) |
rowsBetween(start, end) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.WindowSpec.rowsBetween.html) |
Defines the frame boundaries, from |
High compatibility APIs¶
APIs with high compatibility work correctly but might have minor differences in error messages, output format, or edge cases.
Group |
Method |
Description |
|---|---|---|
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
agg(*exprs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.agg.html) |
Aggregates on the entire DataFrame without groups (shorthand for
|
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
colRegex(colName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.colRegex.html) |
Selects column based on the column name specified as a regex and returns it as a
|
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
corr(col1, col2[, method]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.corr.html) |
Calculates the correlation of two columns as a |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
cov(col1, col2) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.cov.html) |
Calculates the sample covariance for the given columns as a |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
crosstab(col1, col2) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.crosstab.html) |
Computes a pair-wise frequency table of the given columns. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
cube(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.cube.html) |
Creates a multi-dimensional cube for the current DataFrame using the specified columns, for running aggregations. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
describe(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.describe.html) |
Computes basic statistics for numeric and string columns. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
distinct() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.distinct.html) |
Returns a new DataFrame containing the distinct rows in this DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
drop(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.drop.html) |
Returns a new DataFrame without the specified columns. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
exceptAll(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.exceptAll.html) |
Returns a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
groupBy(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.groupBy.html) |
Groups the DataFrame using the specified columns, returning a |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
groupby(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.groupBy.html) |
Alias for |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
intersect(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.intersect.html) |
Returns a new DataFrame containing rows only in both this DataFrame and another DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
intersectAll(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.intersectAll.html) |
Returns a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
isLocal() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.isLocal.html) |
Returns |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
mapInPandas(func, schema) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html) |
Maps an iterator of batches in the current DataFrame using a Python native function
that takes and outputs a |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
orderBy(*cols, **kwargs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.orderBy.html) |
Returns a new DataFrame sorted by the specified columns. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
rollup(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.rollup.html) |
Creates a multi-dimensional rollup for the current DataFrame using the specified columns. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
sort(*cols, **kwargs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sort.html) |
Returns a new DataFrame sorted by the specified columns. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
union(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.union.html) |
Returns a new DataFrame containing the union of rows in this and another DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
unionByName(other[, allowMissingColumns]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unionByName.html) |
Returns a new DataFrame containing the union of rows, resolving columns by name. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
withColumn(colName, col) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html) |
Returns a new DataFrame by adding a column or replacing the existing column that has the same name. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
alias(*alias, **kwargs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.alias.html) |
Returns this column aliased with a new name or names. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
asc_nulls_first() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.asc_nulls_first.html) |
Returns a sort expression based on ascending order with null values returned before non-null values. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
asc_nulls_last() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.asc_nulls_last.html) |
Returns a sort expression based on ascending order with null values returned after non-null values. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
astype(dataType) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.astype.html) |
Casts the column into the specified type. Alias for |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
bitwiseAND(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.bitwiseAND.html) |
Computes bitwise AND of this expression with another expression. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
bitwiseOR(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.bitwiseOR.html) |
Computes bitwise OR of this expression with another expression. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
bitwiseXOR(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.bitwiseXOR.html) |
Computes bitwise XOR of this expression with another expression. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
cast(dataType) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.cast.html) |
Casts the column into the specified type. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
desc_nulls_first() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.desc_nulls_first.html) |
Returns a sort expression based on descending order with null values returned before non-null values. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
desc_nulls_last() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.desc_nulls_last.html) |
Returns a sort expression based on descending order with null values returned after non-null values. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
endswith(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.endswith.html) |
String ends with. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
isNotNull() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.isNotNull.html) |
Returns |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
createDataFrame(data[, schema, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.createDataFrame.html) |
Creates a DataFrame from an RDD, a list, or a |
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
csv(path[, schema, sep, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html) |
Loads a CSV file and returns the result as a DataFrame. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
currentCatalog() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.currentCatalog.html) |
Returns the current default catalog. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
listCatalogs([pattern]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.listCatalogs.html) |
Returns a list of catalogs available. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
listColumns(tableName[, dbName]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.listColumns.html) |
Returns a list of columns for the given table/view. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
recoverPartitions(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.recoverPartitions.html) |
Recovers all the partitions of the given table. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
setCurrentCatalog(catalogName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.setCurrentCatalog.html) |
Sets the current default catalog. |
Note
DataFrame:
orderBy/sort: Column ordering is inferred from the last DataFrame in the chain.DataFrame:
union/unionByName: Type widening behavior might differ slightly.DataFrame:
describe: Statistical output format might vary.Column:
cast: Some invalid casts return NULL in Spark but error in Snowpark.Column:
alias: Struct field display format might differ.SparkSession:
createDataFrame: Schema inference might produce different types (such asNUMBER(38,0)vsLONG).Catalog:
listColumns: Column names are uppercase, types are Snowflake-specific. Error messages might differ in format.
Partial compatibility APIs¶
APIs with partial compatibility are functional but have notable limitations. Behavior might differ from PySpark in specific scenarios.
Group |
Method |
Description |
|---|---|---|
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
alias(alias) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.alias.html) |
Returns a new DataFrame with an alias set. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
approxQuantile(col, probabilities, relativeError) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.approxQuantile.html) |
Calculates the approximate quantiles of numerical columns of a DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
createGlobalTempView(name) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.createGlobalTempView.html) |
Creates a global temporary view with this DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
createOrReplaceGlobalTempView(name) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.createOrReplaceGlobalTempView.html) |
Creates or replaces a global temporary view using this DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
createOrReplaceTempView(name) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.createOrReplaceTempView.html) |
Creates or replaces a local temporary view with this DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
createTempView(name) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.createTempView.html) |
Creates a local temporary view with this DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
explain([extended, mode]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.explain.html) |
Prints the (logical and physical) plans to the console for debugging purposes. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
filter(condition) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html) |
Filters rows using the given condition. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
freqItems(cols[, support]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.freqItems.html) |
Finds all items which have a frequency greater than or equal to a fraction of the total number of rows. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
hint(name, *parameters) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.hint.html) |
Specifies some hint on the current DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
inputFiles() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.inputFiles.html) |
Returns a best-effort snapshot of the files that compose this DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
printSchema([level]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.printSchema.html) |
Prints out the schema in tree format. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
randomSplit(weights[, seed]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html) |
Randomly splits this DataFrame into separate DataFrames with the given weights. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
repartition(numPartitions, *cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.repartition.html) |
Returns a new DataFrame partitioned by the given partitioning expressions. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
sameSemantics(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sameSemantics.html) |
Returns |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
sample([withReplacement, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sample.html) |
Returns a sampled subset of this DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
sampleBy(col, fractions[, seed]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sampleBy.html) |
Returns a stratified sample without replacement based on the fraction given on each stratum. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
selectExpr(*expr) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.selectExpr.html) |
Projects a set of SQL expressions and returns a new DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
semanticHash() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.semanticHash.html) |
Returns a hash code of the logical query plan against this DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
sortWithinPartitions(*cols, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sortWithinPartitions.html) |
Returns a new DataFrame with each partition sorted by the specified columns. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
subtract(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.subtract.html) |
Returns a new DataFrame containing rows in this DataFrame but not in another DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
summary(*statistics) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.summary.html) |
Computes specified statistics for numeric and string columns. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
transform(func, *args, **kwargs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.transform.html) |
Returns a new DataFrame by applying a chain of custom transformations. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
withColumns(*colsMap) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumns.html) |
Returns a new DataFrame by adding multiple columns or replacing existing columns that have the same names. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
withMetadata(columnName, metadata) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withMetadata.html) |
Returns a new DataFrame by updating an existing column with metadata. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
dropFields(*fieldNames) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.dropFields.html) |
Returns a new |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
ilike(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.ilike.html) |
SQL ILIKE expression (case-insensitive LIKE). |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
over(window) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.over.html) |
Defines a windowing column. |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
rlike(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.rlike.html) |
SQL RLIKE expression (regex match). |
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) |
withField(fieldName, col) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.withField.html) |
Returns a new |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
addArtifact(*path) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.addArtifact.html) |
Adds an artifact to the session. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
addArtifacts(*path) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.addArtifacts.html) |
Adds artifacts to the session. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
addTag(tag) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.addTag.html) |
Adds a tag to be assigned to all operations started by this thread in this session. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
clearTags() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.clearTags.html) |
Clears the current thread’s operation tags. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
getTags() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.getTags.html) |
Returns the operation tags that are currently set to be assigned to all operations started by this thread. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
interruptAll() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.interruptAll.html) |
Interrupts all operations of this session currently running on the connected server. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
interruptOperation(op_id) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.interruptOperation.html) |
Interrupts an operation of this session with the given operation ID. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
interruptTag(tag) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.interruptTag.html) |
Interrupts all operations of this session with the given operation tag. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
removeTag(tag) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.removeTag.html) |
Removes a tag previously added to be assigned to all operations started by this thread in this session. |
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) |
apply(udf) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.apply.html) |
Maps each group of the current DataFrame using a pandas UDF. |
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) |
avg(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.avg.html) |
Computes average values for each numeric column for each group. |
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) |
sum(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.sum.html) |
Computes the sum for each numeric column for each group. |
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
json(path[, schema, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.json.html) |
Loads JSON files and returns the result as a DataFrame. |
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
load([path, format, schema, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.load.html) |
Loads data from a data source and returns it as a DataFrame. |
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
parquet(*paths, **options) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.parquet.html) |
Loads Parquet files, returning the result as a DataFrame. |
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
jdbc(url, table[, column, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.jdbc.html) |
Constructs a DataFrame representing the database table accessible via JDBC URL. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
csv(path, mode, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.csv.html) |
Saves the content of the DataFrame in CSV format at the specified path. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
json(path, mode, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.json.html) |
Saves the content of the DataFrame in JSON format at the specified path. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
options(**options) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.options.html) |
Adds output options for the underlying data source. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
parquet(path, mode, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.parquet.html) |
Saves the content of the DataFrame in Parquet format at the specified path. |
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
append() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.append.html) |
Appends the contents of the DataFrame to the output table. |
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
create() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.create.html) |
Creates a new table from the contents of the DataFrame. |
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
createOrReplace() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.createOrReplace.html) |
Creates a new table or replaces an existing table with the contents of the DataFrame. |
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
option(key, value) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.option.html) |
Adds a write option. |
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
options(**options) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.options.html) |
Adds write options. |
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
partitionedBy(col, *cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.partitionedBy.html) |
Specifies a partitioning column. |
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
tableProperty(property, value) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.tableProperty.html) |
Adds a table property. |
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
using(provider) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.using.html) |
Specifies a provider for the underlying output data source. |
Note
DataFrame:
explain: Query plan format differs from Spark.DataFrame:
repartition: Partition count might not be exact.DataFrame:
sample: Random sampling implementation differs.DataFrame:
createTempView: View lifecycle might differ.Column:
over: Window frame specifications might have subtle differences.Column:
rlike: Regex syntax follows Snowflake conventions.SparkSession: Tags are mapped to Snowflake query tags. Interrupt operations use Snowflake query IDs instead of operation IDs.
DataFrameReader: File paths use Snowflake stages or cloud storage (S3, GCS, Azure). Schema inference might differ from native Spark. Some format-specific options might not be supported.
DataFrameWriter: Writes go to Snowflake stages or cloud storage. Partitioning behavior might differ.
Unsupported APIs¶
The following APIs are not currently supported in Snowpark Connect for Spark.
Group |
Method |
Description |
|---|---|---|
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
dropDuplicatesWithinWatermark([subset]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicatesWithinWatermark.html) |
Returns a new DataFrame with duplicate rows removed within watermark. Streaming only. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
observe(observation, *exprs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.observe.html) |
Defines (named) metrics to observe on the DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
pandas_api([index_col]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.pandas_api.html) |
Converts the existing DataFrame into a pandas-on-Spark DataFrame. |
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
registerTempTable(name) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.registerTempTable.html) |
Registers this DataFrame as a temporary table. Deprecated since 2.0; use
|
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
to_pandas_on_spark([index_col]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.pandas_api.html) |
Converts the existing DataFrame into a pandas-on-Spark DataFrame. Alias for
|
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) |
withWatermark(eventTime, delayThreshold) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withWatermark.html) |
Defines an event time watermark for this DataFrame. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
copyFromLocalToFs(local_path, dest_path) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.copyFromLocalToFs.html) |
Copies a local file to a remote filesystem. |
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) |
stop() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.stop.html) |
Stops the underlying |
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) |
applyInPandasWithState(func, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.applyInPandasWithState.html) |
Applies a function to each group of data using pandas with state. |
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) |
cogroup(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.cogroup.html) |
Cogroups this group with another group. |
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
orc(path, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.orc.html) |
Loads ORC files, returning the result as a DataFrame. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
bucketBy(numBuckets, col, *cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.bucketBy.html) |
Buckets the output by the given columns. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
insertInto(tableName[, overwrite]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.insertInto.html) |
Inserts the content of the DataFrame to the specified table. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
jdbc(url, table[, mode, properties]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.jdbc.html) |
Saves the content of the DataFrame to an external database table via JDBC. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
orc(path, mode, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.orc.html) |
Saves the content of the DataFrame in ORC format at the specified path. |
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) |
sortBy(col, *cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.sortBy.html) |
Specifies sorting columns for each output partition. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
createExternalTable(tableName, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.createExternalTable.html) |
Creates a table based on the dataset in a data source. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
createTable(tableName, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.createTable.html) |
Creates a table based on the dataset in a data source. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
functionExists(functionName[, dbName]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.functionExists.html) |
Checks if the function with the specified name exists. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
getFunction(functionName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.getFunction.html) |
Gets the function with the specified name. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
listFunctions([dbName]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.listFunctions.html) |
Returns a list of functions registered in the specified database. |
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) |
registerFunction(name, f, returnType) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.registerFunction.html) |
Registers a Python function as a UDF. |