DataFrame support for Snowpark Connect for Spark¶

Snowpark Connect for Spark provides compatibility with the PySpark 3.5.3 Spark Connect DataFrame API (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html), allowing you to run Spark workloads on Snowflake. This page details which APIs are supported and their compatibility levels. The DataFrame API is shared across PySpark, Java, and Scala clients.

Compatibility level definitions¶

Full compatibility

APIs with full compatibility behave identically to native PySpark. You can use these APIs with confidence that results will match exactly.

High compatibility

APIs with high compatibility work correctly but might have minor differences:

Error message formatting might differ.
Output display format might vary (such as decimal precision, column name casing).
Edge cases might produce slightly different results.

Partial compatibility

APIs with partial compatibility are functional but have notable limitations:

Only a subset of functionality might be available.
Behavior might differ from PySpark in specific scenarios.
Additional configuration might be required.
Performance characteristics might differ.

Unsupported

APIs that are not currently implemented or cannot be supported on Snowflake.

Full compatibility APIs¶

High compatibility APIs¶

APIs with high compatibility work correctly but might have minor differences in error messages, output format, or edge cases.

Note

DataFrame: orderBy / sort: Column ordering is inferred from the last DataFrame in the chain.
DataFrame: union / unionByName: Type widening behavior might differ slightly.
DataFrame: describe: Statistical output format might vary.
Column: cast: Some invalid casts return NULL in Spark but error in Snowpark.
Column: alias: Struct field display format might differ.
SparkSession: createDataFrame: Schema inference might produce different types (such as NUMBER(38,0) vs LONG).
Catalog: listColumns: Column names are uppercase, types are Snowflake-specific. Error messages might differ in format.

Partial compatibility APIs¶

APIs with partial compatibility are functional but have notable limitations. Behavior might differ from PySpark in specific scenarios.

Note

DataFrame: explain: Query plan format differs from Spark.
DataFrame: repartition: Partition count might not be exact.
DataFrame: sample: Random sampling implementation differs.
DataFrame: createTempView: View lifecycle might differ.
Column: over: Window frame specifications might have subtle differences.
Column: rlike: Regex syntax follows Snowflake conventions.
SparkSession: Tags are mapped to Snowflake query tags. Interrupt operations use Snowflake query IDs instead of operation IDs.
DataFrameReader: File paths use Snowflake stages or cloud storage (S3, GCS, Azure). Schema inference might differ from native Spark. Some format-specific options might not be supported.
DataFrameWriter: Writes go to Snowflake stages or cloud storage. Partitioning behavior might differ.

Unsupported APIs¶

The following APIs are not currently supported in Snowpark Connect for Spark.

Group	Method	Description
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	agg(*exprs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.agg.html)	Aggregates on the entire DataFrame without groups (shorthand for `df.groupBy().agg()`).
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	colRegex(colName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.colRegex.html)	Selects column based on the column name specified as a regex and returns it as a `Column`.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	corr(col1, col2[, method]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.corr.html)	Calculates the correlation of two columns as a `float`.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	cov(col1, col2) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.cov.html)	Calculates the sample covariance for the given columns as a `float`.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	crosstab(col1, col2) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.crosstab.html)	Computes a pair-wise frequency table of the given columns.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	cube(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.cube.html)	Creates a multi-dimensional cube for the current DataFrame using the specified columns, for running aggregations.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	describe(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.describe.html)	Computes basic statistics for numeric and string columns.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	distinct() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.distinct.html)	Returns a new DataFrame containing the distinct rows in this DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	drop(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.drop.html)	Returns a new DataFrame without the specified columns.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	exceptAll(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.exceptAll.html)	Returns a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	groupBy(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.groupBy.html)	Groups the DataFrame using the specified columns, returning a `GroupedData` object for running aggregations.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	groupby(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.groupBy.html)	Alias for `groupBy`.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	intersect(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.intersect.html)	Returns a new DataFrame containing rows only in both this DataFrame and another DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	intersectAll(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.intersectAll.html)	Returns a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	isLocal() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.isLocal.html)	Returns `True` if `collect` and `take` can be run locally.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	mapInPandas(func, schema) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html)	Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a `pandas.DataFrame`.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	orderBy(cols, *kwargs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.orderBy.html)	Returns a new DataFrame sorted by the specified columns.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	rollup(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.rollup.html)	Creates a multi-dimensional rollup for the current DataFrame using the specified columns.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	sort(cols, *kwargs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sort.html)	Returns a new DataFrame sorted by the specified columns.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	union(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.union.html)	Returns a new DataFrame containing the union of rows in this and another DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	unionByName(other[, allowMissingColumns]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unionByName.html)	Returns a new DataFrame containing the union of rows, resolving columns by name.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	withColumn(colName, col) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html)	Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	alias(alias, *kwargs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.alias.html)	Returns this column aliased with a new name or names.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	asc_nulls_first() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.asc_nulls_first.html)	Returns a sort expression based on ascending order with null values returned before non-null values.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	asc_nulls_last() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.asc_nulls_last.html)	Returns a sort expression based on ascending order with null values returned after non-null values.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	astype(dataType) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.astype.html)	Casts the column into the specified type. Alias for `cast`.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	bitwiseAND(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.bitwiseAND.html)	Computes bitwise AND of this expression with another expression.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	bitwiseOR(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.bitwiseOR.html)	Computes bitwise OR of this expression with another expression.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	bitwiseXOR(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.bitwiseXOR.html)	Computes bitwise XOR of this expression with another expression.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	cast(dataType) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.cast.html)	Casts the column into the specified type.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	desc_nulls_first() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.desc_nulls_first.html)	Returns a sort expression based on descending order with null values returned before non-null values.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	desc_nulls_last() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.desc_nulls_last.html)	Returns a sort expression based on descending order with null values returned after non-null values.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	endswith(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.endswith.html)	String ends with.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	isNotNull() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.isNotNull.html)	Returns `True` if the current expression is not null.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	createDataFrame(data[, schema, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.createDataFrame.html)	Creates a DataFrame from an RDD, a list, or a `pandas.DataFrame`.
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	csv(path[, schema, sep, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html)	Loads a CSV file and returns the result as a DataFrame.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	currentCatalog() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.currentCatalog.html)	Returns the current default catalog.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	listCatalogs([pattern]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.listCatalogs.html)	Returns a list of catalogs available.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	listColumns(tableName[, dbName]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.listColumns.html)	Returns a list of columns for the given table/view.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	recoverPartitions(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.recoverPartitions.html)	Recovers all the partitions of the given table.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	setCurrentCatalog(catalogName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.setCurrentCatalog.html)	Sets the current default catalog.

Group	Method	Description
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	alias(alias) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.alias.html)	Returns a new DataFrame with an alias set.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	approxQuantile(col, probabilities, relativeError) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.approxQuantile.html)	Calculates the approximate quantiles of numerical columns of a DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	createGlobalTempView(name) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.createGlobalTempView.html)	Creates a global temporary view with this DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	createOrReplaceGlobalTempView(name) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.createOrReplaceGlobalTempView.html)	Creates or replaces a global temporary view using this DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	createOrReplaceTempView(name) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.createOrReplaceTempView.html)	Creates or replaces a local temporary view with this DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	createTempView(name) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.createTempView.html)	Creates a local temporary view with this DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	explain([extended, mode]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.explain.html)	Prints the (logical and physical) plans to the console for debugging purposes.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	filter(condition) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html)	Filters rows using the given condition.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	freqItems(cols[, support]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.freqItems.html)	Finds all items which have a frequency greater than or equal to a fraction of the total number of rows.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	hint(name, *parameters) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.hint.html)	Specifies some hint on the current DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	inputFiles() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.inputFiles.html)	Returns a best-effort snapshot of the files that compose this DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	printSchema([level]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.printSchema.html)	Prints out the schema in tree format.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	randomSplit(weights[, seed]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html)	Randomly splits this DataFrame into separate DataFrames with the given weights.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	repartition(numPartitions, *cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.repartition.html)	Returns a new DataFrame partitioned by the given partitioning expressions.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	sameSemantics(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sameSemantics.html)	Returns `True` when the logical query plans inside both DataFrames are equal and therefore return the same results.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	sample([withReplacement, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sample.html)	Returns a sampled subset of this DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	sampleBy(col, fractions[, seed]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sampleBy.html)	Returns a stratified sample without replacement based on the fraction given on each stratum.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	selectExpr(*expr) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.selectExpr.html)	Projects a set of SQL expressions and returns a new DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	semanticHash() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.semanticHash.html)	Returns a hash code of the logical query plan against this DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	sortWithinPartitions(*cols, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sortWithinPartitions.html)	Returns a new DataFrame with each partition sorted by the specified columns.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	subtract(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.subtract.html)	Returns a new DataFrame containing rows in this DataFrame but not in another DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	summary(*statistics) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.summary.html)	Computes specified statistics for numeric and string columns.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	transform(func, args, *kwargs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.transform.html)	Returns a new DataFrame by applying a chain of custom transformations.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	withColumns(*colsMap) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumns.html)	Returns a new DataFrame by adding multiple columns or replacing existing columns that have the same names.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	withMetadata(columnName, metadata) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withMetadata.html)	Returns a new DataFrame by updating an existing column with metadata.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	dropFields(*fieldNames) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.dropFields.html)	Returns a new `Column` with the specified nested fields dropped.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	ilike(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.ilike.html)	SQL ILIKE expression (case-insensitive LIKE).
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	over(window) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.over.html)	Defines a windowing column.
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	rlike(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.rlike.html)	SQL RLIKE expression (regex match).
Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)	withField(fieldName, col) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.withField.html)	Returns a new `Column` with a field added or replaced in a `StructType`.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	addArtifact(*path) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.addArtifact.html)	Adds an artifact to the session.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	addArtifacts(*path) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.addArtifacts.html)	Adds artifacts to the session.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	addTag(tag) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.addTag.html)	Adds a tag to be assigned to all operations started by this thread in this session.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	clearTags() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.clearTags.html)	Clears the current thread’s operation tags.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	getTags() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.getTags.html)	Returns the operation tags that are currently set to be assigned to all operations started by this thread.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	interruptAll() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.interruptAll.html)	Interrupts all operations of this session currently running on the connected server.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	interruptOperation(op_id) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.interruptOperation.html)	Interrupts an operation of this session with the given operation ID.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	interruptTag(tag) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.interruptTag.html)	Interrupts all operations of this session with the given operation tag.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	removeTag(tag) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.removeTag.html)	Removes a tag previously added to be assigned to all operations started by this thread in this session.
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html)	apply(udf) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.apply.html)	Maps each group of the current DataFrame using a pandas UDF.
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html)	avg(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.avg.html)	Computes average values for each numeric column for each group.
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html)	sum(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.sum.html)	Computes the sum for each numeric column for each group.
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	json(path[, schema, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.json.html)	Loads JSON files and returns the result as a DataFrame.
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	load([path, format, schema, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.load.html)	Loads data from a data source and returns it as a DataFrame.
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	parquet(paths, *options) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.parquet.html)	Loads Parquet files, returning the result as a DataFrame.
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	jdbc(url, table[, column, …]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.jdbc.html)	Constructs a DataFrame representing the database table accessible via JDBC URL.
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	csv(path, mode, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.csv.html)	Saves the content of the DataFrame in CSV format at the specified path.
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	json(path, mode, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.json.html)	Saves the content of the DataFrame in JSON format at the specified path.
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	options(**options) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.options.html)	Adds output options for the underlying data source.
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	parquet(path, mode, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.parquet.html)	Saves the content of the DataFrame in Parquet format at the specified path.
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	append() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.append.html)	Appends the contents of the DataFrame to the output table.
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	create() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.create.html)	Creates a new table from the contents of the DataFrame.
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	createOrReplace() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.createOrReplace.html)	Creates a new table or replaces an existing table with the contents of the DataFrame.
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	option(key, value) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.option.html)	Adds a write option.
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	options(**options) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.options.html)	Adds write options.
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	partitionedBy(col, *cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.partitionedBy.html)	Specifies a partitioning column.
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	tableProperty(property, value) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.tableProperty.html)	Adds a table property.
DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	using(provider) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.using.html)	Specifies a provider for the underlying output data source.

Group	Method	Description
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	dropDuplicatesWithinWatermark([subset]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicatesWithinWatermark.html)	Returns a new DataFrame with duplicate rows removed within watermark. Streaming only.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	observe(observation, *exprs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.observe.html)	Defines (named) metrics to observe on the DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	pandas_api([index_col]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.pandas_api.html)	Converts the existing DataFrame into a pandas-on-Spark DataFrame.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	registerTempTable(name) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.registerTempTable.html)	Registers this DataFrame as a temporary table. Deprecated since 2.0; use `createOrReplaceTempView` instead.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	to_pandas_on_spark([index_col]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.pandas_api.html)	Converts the existing DataFrame into a pandas-on-Spark DataFrame. Alias for `pandas_api`.
DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)	withWatermark(eventTime, delayThreshold) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withWatermark.html)	Defines an event time watermark for this DataFrame.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	copyFromLocalToFs(local_path, dest_path) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.copyFromLocalToFs.html)	Copies a local file to a remote filesystem.
SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html)	stop() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.stop.html)	Stops the underlying `SparkContext`.
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html)	applyInPandasWithState(func, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.applyInPandasWithState.html)	Applies a function to each group of data using pandas with state.
GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html)	cogroup(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.cogroup.html)	Cogroups this group with another group.
DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	orc(path, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.orc.html)	Loads ORC files, returning the result as a DataFrame.
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	bucketBy(numBuckets, col, *cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.bucketBy.html)	Buckets the output by the given columns.
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	insertInto(tableName[, overwrite]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.insertInto.html)	Inserts the content of the DataFrame to the specified table.
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	jdbc(url, table[, mode, properties]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.jdbc.html)	Saves the content of the DataFrame to an external database table via JDBC.
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	orc(path, mode, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.orc.html)	Saves the content of the DataFrame in ORC format at the specified path.
DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html)	sortBy(col, *cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.sortBy.html)	Specifies sorting columns for each output partition.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	createExternalTable(tableName, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.createExternalTable.html)	Creates a table based on the dataset in a data source.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	createTable(tableName, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.createTable.html)	Creates a table based on the dataset in a data source.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	functionExists(functionName[, dbName]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.functionExists.html)	Checks if the function with the specified name exists.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	getFunction(functionName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.getFunction.html)	Gets the function with the specified name.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	listFunctions([dbName]) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.listFunctions.html)	Returns a list of functions registered in the specified database.
Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html)	registerFunction(name, f, returnType) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.registerFunction.html)	Registers a Python function as a UDF.