| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | cache() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.cache.html) | Persists the DataFrame with the default storage level (MEMORY_AND_DISK). |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | coalesce(numPartitions) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.coalesce.html) | Returns a new DataFrame that has exactly numPartitions partitions. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | collect() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.collect.html) | Returns all the records as a list of Row. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | count() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.count.html) | Returns the number of rows in this DataFrame. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | crossJoin(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.crossJoin.html) | Returns the cartesian product with another DataFrame. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | dropDuplicates([subset]) | Returns a new DataFrame with duplicate rows removed, optionally considering only a subset of columns. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | drop_duplicates([subset]) | Alias for dropDuplicates. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | dropna([how, thresh, subset]) | Returns a new DataFrame omitting rows with null values. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | fillna(value[, subset]) | Replaces null values with the specified value. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | first() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.first.html) | Returns the first row as a Row. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | head([n]) | Returns the first n rows. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | isEmpty() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.isEmpty.html) | Returns True if this DataFrame is empty. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | join(other[, on, how]) | Joins with another DataFrame, using the given join expression. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | limit(num) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.limit.html) | Limits the result count to the number specified. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | melt(ids, values, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.melt.html) | Unpivots a DataFrame from wide format to long format. Alias for unpivot. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | offset(num) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.offset.html) | Returns a new DataFrame by skipping the first n rows. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | persist([storageLevel]) | Sets the storage level to persist the contents of the DataFrame across operations. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | repartitionByRange(numPartitions, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.repartitionByRange.html) | Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions using range partitioning. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | replace(to_replace[, value, subset]) | Returns a new DataFrame replacing a value with another value. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | select(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.select.html) | Projects a set of expressions and returns a new DataFrame. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | show([n, truncate, vertical]) | Prints the first n rows to the console. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | tail(num) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.tail.html) | Returns the last num rows as a list of Row. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | take(num) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.take.html) | Returns the first num rows as a list of Row. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | toDF(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toDF.html) | Returns a new DataFrame that with new column names. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | toLocalIterator([prefetchPartitions]) | Returns an iterator that contains all of the rows in this DataFrame. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | toPandas() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toPandas.html) | Returns the contents of this DataFrame as a Pandas pandas.DataFrame. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | unionAll(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unionAll.html) | Returns a new DataFrame containing the union of rows in this and another DataFrame. Alias for union. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | unpersist([blocking]) | Marks the DataFrame as non-persistent, and removes all blocks for it from memory and disk. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | unpivot(ids, values, …) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unpivot.html) | Unpivots a DataFrame from wide format to long format, optionally leaving identifier columns set. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | where(condition) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.where.html) | Alias for filter. |
| DataFrame (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) | withColumnsRenamed(colsMap) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumnsRenamed.html) | Returns a new DataFrame by renaming multiple columns. This is a no-op if the schema doesn’t contain the given column names. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | asc() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.asc.html) | Returns a sort expression based on ascending order of the column. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | between(lowerBound, upperBound) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.between.html) | Checks if values of this expression are between the given columns. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | contains(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.contains.html) | Contains the other element. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | desc() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.desc.html) | Returns a sort expression based on descending order of the column. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | eqNullSafe(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.eqNullSafe.html) | Equality test that is safe for null values. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | getItem(key) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.getItem.html) | Gets an item at position key out of a list or dict. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | isNull() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.isNull.html) | Returns True if the current expression is null. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | isin(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.isin.html) | Returns a boolean Column based on a match against the given values. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | like(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.like.html) | SQL LIKE expression. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | otherwise(value) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.otherwise.html) | Evaluates a list of conditions and returns one of multiple possible result expressions. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | startswith(other) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.startswith.html) | String starts with. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | substr(startPos, length) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.substr.html) | Returns a Column which is a substring of the column. |
| Column (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html) | when(condition, value) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.when.html) | Evaluates a list of conditions and returns one of multiple possible result expressions. |
| SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) | range(start[, end, step, numPartitions]) | Creates a DataFrame with a single column named id. |
| SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) | sql(sqlQuery, args, **kwargs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.sql.html) | Returns a DataFrame representing the result of the given query. |
| SparkSession (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html) | table(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.table.html) | Returns the specified table as a DataFrame. |
| GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) | agg(*exprs) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.agg.html) | Computes aggregates and returns the result as a DataFrame. |
| GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) | mean(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.GroupedData.mean.html) | Computes average values for each numeric column for each group. |
| GroupedData (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/grouping.html) | pivot(pivot_col[, values]) | Pivots a column of the current DataFrame and performs the specified aggregation. |
| DataFrameReader (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) | table(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.table.html) | Returns the specified table as a DataFrame. |
| DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) | mode(saveMode) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.mode.html) | Specifies the behavior when data or table already exist. |
| DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) | saveAsTable(name[, format, mode, …]) | Saves the content of the DataFrame as the specified table. |
| DataFrameWriter (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) | text(path[, compression, lineSep]) | Saves the content of the DataFrame in a text file at the specified path. |
| DataFrameWriterV2 (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/io.html) | replace() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriterV2.replace.html) | Replaces data in the existing table. |
| Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) | cacheTable(tableName[, storageLevel]) | Caches the specified table in-memory. |
| Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) | clearCache() (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.clearCache.html) | Removes all cached tables from the in-memory cache. |
| Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) | dropGlobalTempView(viewName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.dropGlobalTempView.html) | Drops the global temporary view with the given view name. |
| Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) | dropTempView(viewName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.dropTempView.html) | Drops the local temporary view with the given view name. |
| Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) | isCached(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.isCached.html) | Returns True if the table is currently cached in-memory. |
| Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) | refreshByPath(path) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.refreshByPath.html) | Invalidates and refreshes all the cached data for any DataFrame that contains the given data source path. |
| Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) | refreshTable(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.refreshTable.html) | Invalidates and refreshes all the cached data and metadata of the given table. |
| Catalog (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/catalog.html) | uncacheTable(tableName) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.uncacheTable.html) | Removes the specified table from the in-memory cache. |
| Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | partitionBy(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.partitionBy.html) | Creates a WindowSpec with the partitioning defined. |
| Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | orderBy(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.orderBy.html) | Creates a WindowSpec with the ordering defined. |
| Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | rangeBetween(start, end) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.rangeBetween.html) | Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive). |
| Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | rowsBetween(start, end) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.rowsBetween.html) | Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive). |
| Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | unboundedPreceding | Value representing the first row in the partition, for use in frame boundary definition. |
| Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | unboundedFollowing | Value representing the last row in the partition, for use in frame boundary definition. |
| Window (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | currentRow | Value representing the current row, for use in frame boundary definition. |
| WindowSpec (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | partitionBy(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.WindowSpec.partitionBy.html) | Defines the partitioning columns in a WindowSpec. |
| WindowSpec (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | orderBy(*cols) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.WindowSpec.orderBy.html) | Defines the ordering columns in a WindowSpec. |
| WindowSpec (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | rangeBetween(start, end) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.WindowSpec.rangeBetween.html) | Defines the frame boundaries, from start (inclusive) to end (inclusive). |
| WindowSpec (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/window.html) | rowsBetween(start, end) (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.WindowSpec.rowsBetween.html) | Defines the frame boundaries, from start (inclusive) to end (inclusive). |