Dataset support for Snowpark Connect for Spark (Java/Scala)¶
class Dataset[T] extends Serializable
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
Operations available on Datasets are divided into transformations and actions. Transformations
produce new Datasets, and actions trigger computation and return results. Example
transformations include map, filter, select, and groupBy. Example
actions include count, show, and collect.
Snowpark Connect for Spark supports the Spark 3.5 Dataset API (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html) for Java and Scala. There is only one JVM client, so there’s no significant support difference between using Java or Scala. All supported and unsupported APIs are described in this topic.
For detailed Dataframe API support, see DataFrame support for Snowpark Connect for Spark.
Methods¶
The following table lists all Dataset methods and their support status in Snowpark Connect for Spark.
Method |
Description |
|---|---|
agg(Column, Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#agg(expr:org.apache.spark.sql.Column,exprs:org.apache.spark.sql.Column*):org.apache.spark.sql.DataFrame) |
Aggregates on the entire Dataset without groups. Shorthand for |
agg(Map[String, String]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#agg(exprs:Map[String,String]):org.apache.spark.sql.DataFrame) |
(Scala-specific) Aggregates on the entire Dataset without groups, using a map of column names to aggregate functions. |
agg(java.util.Map[String, String]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#agg(exprs:java.util.Map[String,String]):org.apache.spark.sql.DataFrame) |
(Java-specific) Aggregates on the entire Dataset without groups, using a map of column names to aggregate functions. |
agg((String, String), (String, String)*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#agg(aggExpr:(String,String),aggExprs:(String,String)*):org.apache.spark.sql.DataFrame) |
(Scala-specific) Aggregates on the entire Dataset without groups, using pairs of column names and aggregate function names. |
alias(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#alias(alias:String):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset with an alias set. Same as |
alias(Symbol) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#alias(alias:Symbol):org.apache.spark.sql.Dataset[T]) |
(Scala-specific) Returns a new Dataset with an alias set. |
apply(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#apply(colName:String):org.apache.spark.sql.Column) |
Selects column based on the column name and returns it as a |
as(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#as(alias:String):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset with an alias set. |
as(Symbol) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#as(alias:Symbol):org.apache.spark.sql.Dataset[T]) |
(Scala-specific) Returns a new Dataset with an alias set. |
as[U](Encoder[U]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#as[U](implicitevidence$2:org.apache.spark.sql.Encoder[U]):org.apache.spark.sql.Dataset[U]) |
Returns a new Dataset where each record has been mapped on to the specified type |
cache() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#cache():Dataset.this.type) |
Persists this Dataset with the default storage level ( |
coalesce(Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#coalesce(numPartitions:Int):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset that has exactly |
col(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#col(colName:String):org.apache.spark.sql.Column) |
Selects column based on the column name and returns it as a |
colRegex(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#colRegex(colName:String):org.apache.spark.sql.Column) |
Selects column based on the column name specified as a regex and returns it as a
|
collect() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#collect():Array[T]) |
Returns an |
collectAsList() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#collectAsList():java.util.List[T]) |
Returns a |
count() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#count():Long) |
Returns the number of rows in the Dataset as a |
createGlobalTempView(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#createGlobalTempView(viewName:String):Unit) |
Creates a global temporary view using the given name. |
createOrReplaceGlobalTempView(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#createOrReplaceGlobalTempView(viewName:String):Unit) |
Creates or replaces a global temporary view using the given name. |
createOrReplaceTempView(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#createOrReplaceTempView(viewName:String):Unit) |
Creates a local temporary view using the given name. |
createTempView(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#createTempView(viewName:String):Unit) |
Creates a local temporary view using the given name. |
crossJoin(Dataset[_]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#crossJoin(right:org.apache.spark.sql.Dataset[_]):org.apache.spark.sql.DataFrame) |
Explicit cartesian join with another |
cube(String, String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#cube(col1:String,cols:String*):org.apache.spark.sql.RelationalGroupedDataset) |
Creates a multi-dimensional cube for the current Dataset using column names for running aggregations. |
cube(Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#cube(cols:org.apache.spark.sql.Column*):org.apache.spark.sql.RelationalGroupedDataset) |
Creates a multi-dimensional cube for the current Dataset using |
describe(String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#describe(cols:String*):org.apache.spark.sql.DataFrame) |
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. |
distinct() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#distinct():org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset that contains only the unique rows. This is an alias for
|
drop(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#drop(colName:String):org.apache.spark.sql.DataFrame) |
Returns a new Dataset with a column dropped by name. This is a no-op if the schema doesn’t contain the column name. |
drop(String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#drop(colNames:String*):org.apache.spark.sql.DataFrame) |
Returns a new Dataset with multiple columns dropped by name. |
drop(Column) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#drop(col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame) |
Returns a new Dataset with a column dropped. Accepts a |
drop(Column, Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#drop(col:org.apache.spark.sql.Column,cols:org.apache.spark.sql.Column*):org.apache.spark.sql.DataFrame) |
Returns a new Dataset with multiple columns dropped using |
dropDuplicates() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#dropDuplicates():org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset with duplicate rows removed. |
dropDuplicates(Seq[String]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#dropDuplicates(colNames:Seq[String]):org.apache.spark.sql.Dataset[T]) |
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns. |
dropDuplicates(Array[String]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#dropDuplicates(colNames:Array[String]):org.apache.spark.sql.Dataset[T]) |
(Java-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns. |
dropDuplicates(String, String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#dropDuplicates(col1:String,cols:String*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset with duplicate rows removed, considering only the subset of columns. |
except(Dataset[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#except(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
Equivalent to |
exceptAll(Dataset[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#exceptAll(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset containing rows in this Dataset but not in another Dataset while
preserving duplicates. Equivalent to |
explain() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#explain():Unit) |
Prints the physical plan to the console for debugging purposes. |
explain(Boolean) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#explain(extended:Boolean):Unit) |
Prints the plans (logical and physical) to the console for debugging purposes. |
explain(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#explain(mode:String):Unit) |
Prints the plans with a format specified by a given explain mode ( |
filter(Column) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#filter(condition:org.apache.spark.sql.Column):org.apache.spark.sql.Dataset[T]) |
Filters rows using the given |
filter(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#filter(conditionExpr:String):org.apache.spark.sql.Dataset[T]) |
Filters rows using the given SQL expression string. |
filter(FilterFunction[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#filter(func:org.apache.spark.api.java.function.FilterFunction[T]):org.apache.spark.sql.Dataset[T]) |
(Java-specific) Returns a new Dataset that only contains elements where |
filter(T => Boolean) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#filter(func:T=%3EBoolean):org.apache.spark.sql.Dataset[T]) |
(Scala-specific) Returns a new Dataset that only contains elements where |
first() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#first():T) |
Returns the first row. Alias for |
flatMap[U](FlatMapFunction[T, U], Encoder[U]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#flatMap[U](f:org.apache.spark.api.java.function.FlatMapFunction[T,U],encoder:org.apache.spark.sql.Encoder[U]):org.apache.spark.sql.Dataset[U]) |
(Java-specific) Returns a new Dataset by first applying a function to all elements and then flattening the results. |
flatMap[U](T => TraversableOnce[U])(Encoder[U]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#flatMap[U](func:T=%3ETraversableOnce[U])(implicitevidence$8:org.apache.spark.sql.Encoder[U]):org.apache.spark.sql.Dataset[U]) |
(Scala-specific) Returns a new Dataset by first applying a function to all elements and then flattening the results. |
foreach(ForeachFunction[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#foreach(func:org.apache.spark.api.java.function.ForeachFunction[T]):Unit) |
(Java-specific) Runs |
foreach(T => Unit) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#foreach(f:T=%3EUnit):Unit) |
(Scala-specific) Applies a function to all rows. |
foreachPartition(ForeachPartitionFunction[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#foreachPartition(func:org.apache.spark.api.java.function.ForeachPartitionFunction[T]):Unit) |
(Java-specific) Runs |
foreachPartition(Iterator[T] => Unit) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#foreachPartition(f:Iterator[T]=%3EUnit):Unit) |
(Scala-specific) Applies a function to each partition of this Dataset. |
groupBy(Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#groupBy(cols:org.apache.spark.sql.Column*):org.apache.spark.sql.RelationalGroupedDataset) |
Groups the Dataset using the specified |
groupBy(String, String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#groupBy(col1:String,cols:String*):org.apache.spark.sql.RelationalGroupedDataset) |
Groups the Dataset using the specified column names for running aggregations. |
groupByKey[K](MapFunction[T, K], Encoder[K]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#groupByKey[K](func:org.apache.spark.api.java.function.MapFunction[T,K],encoder:org.apache.spark.sql.Encoder[K]):org.apache.spark.sql.KeyValueGroupedDataset[K,T]) |
(Java-specific) Returns a |
groupByKey[K](T => K)(Encoder[K]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#groupByKey[K](func:T=%3EK)(implicitevidence$3:org.apache.spark.sql.Encoder[K]):org.apache.spark.sql.KeyValueGroupedDataset[K,T]) |
(Scala-specific) Returns a |
head() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#head():T) |
Returns the first row. |
head(Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#head(n:Int):Array[T]) |
Returns the first |
hint(String, Any*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#hint(name:String,parameters:Any*):org.apache.spark.sql.Dataset[T]) |
Specifies some hint on the current Dataset (for example, broadcast hint for joins). |
intersect(Dataset[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#intersect(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset containing rows only in both this Dataset and another Dataset.
Equivalent to |
intersectAll(Dataset[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#intersectAll(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset containing rows only in both Datasets while preserving
duplicates. Equivalent to |
join(Dataset[_]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#join(right:org.apache.spark.sql.Dataset[_]):org.apache.spark.sql.DataFrame) |
Joins with another |
join(Dataset[_], String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#join(right:org.apache.spark.sql.Dataset[_],usingColumn:String):org.apache.spark.sql.DataFrame) |
Inner equi-join with another |
join(Dataset[_], Seq[String]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#join(right:org.apache.spark.sql.Dataset[_],usingColumns:Seq[String]):org.apache.spark.sql.DataFrame) |
(Scala-specific) Inner equi-join with another |
join(Dataset[_], Array[String]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#join(right:org.apache.spark.sql.Dataset[_],usingColumns:Array[String]):org.apache.spark.sql.DataFrame) |
(Java-specific) Inner equi-join with another |
join(Dataset[_], String, String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#join(right:org.apache.spark.sql.Dataset[_],usingColumn:String,joinType:String):org.apache.spark.sql.DataFrame) |
Equi-join with another |
join(Dataset[_], Seq[String], String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#join(right:org.apache.spark.sql.Dataset[_],usingColumns:Seq[String],joinType:String):org.apache.spark.sql.DataFrame) |
(Scala-specific) Equi-join with another |
join(Dataset[_], Array[String], String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#join(right:org.apache.spark.sql.Dataset[_],usingColumns:Array[String],joinType:String):org.apache.spark.sql.DataFrame) |
(Java-specific) Equi-join with another |
join(Dataset[_], Column) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame) |
Inner join with another |
join(Dataset[_], Column, String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.DataFrame) |
Joins with another |
joinWith[U](Dataset[U], Column) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#joinWith[U](other:org.apache.spark.sql.Dataset[U],condition:org.apache.spark.sql.Column):org.apache.spark.sql.Dataset[(T,U)]) |
Inner equi-join to join this Dataset, returning a |
joinWith[U](Dataset[U], Column, String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#joinWith[U](other:org.apache.spark.sql.Dataset[U],condition:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.Dataset[(T,U)]) |
Joins this Dataset returning a |
limit(Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#limit(n:Int):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset by taking the first |
map[U](MapFunction[T, U], Encoder[U]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#map[U](func:org.apache.spark.api.java.function.MapFunction[T,U],encoder:org.apache.spark.sql.Encoder[U]):org.apache.spark.sql.Dataset[U]) |
(Java-specific) Returns a new Dataset that contains the result of applying |
map[U](T => U)(Encoder[U]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#map[U](func:T=%3EU)(implicitevidence$6:org.apache.spark.sql.Encoder[U]):org.apache.spark.sql.Dataset[U]) |
(Scala-specific) Returns a new Dataset that contains the result of applying |
mapPartitions[U](MapPartitionsFunction[T, U], Encoder[U]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#mapPartitions[U](f:org.apache.spark.api.java.function.MapPartitionsFunction[T,U],encoder:org.apache.spark.sql.Encoder[U]):org.apache.spark.sql.Dataset[U]) |
(Java-specific) Returns a new Dataset that contains the result of applying |
mapPartitions[U](Iterator[T] => Iterator[U])(Encoder[U]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#mapPartitions[U](func:Iterator[T]=%3EIterator[U])(implicitevidence$7:org.apache.spark.sql.Encoder[U]):org.apache.spark.sql.Dataset[U]) |
(Scala-specific) Returns a new Dataset that contains the result of applying |
melt(Array[Column], Array[Column], String, String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#melt(ids:Array[org.apache.spark.sql.Column],values:Array[org.apache.spark.sql.Column],variableColumnName:String,valueColumnName:String):org.apache.spark.sql.DataFrame) |
Unpivots a |
melt(Array[Column], String, String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#melt(ids:Array[org.apache.spark.sql.Column],variableColumnName:String,valueColumnName:String):org.apache.spark.sql.DataFrame) |
Unpivots a |
metadataColumn(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#metadataColumn(colName:String):org.apache.spark.sql.Column) |
Selects a metadata column based on its logical column name and returns it as a
|
observe(String, Column, Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#observe(name:String,expr:org.apache.spark.sql.Column,exprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]) |
Defines named metrics to observe on the Dataset. |
observe(Observation, Column, Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#observe(observation:org.apache.spark.sql.Observation,expr:org.apache.spark.sql.Column,exprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]) |
Observes named metrics through an |
offset(Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#offset(n:Int):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset by skipping the first |
orderBy(Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#orderBy(sortExprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset sorted by the given |
orderBy(String, String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#orderBy(sortCol:String,sortCols:String*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset sorted by the given column names. |
persist() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#persist():Dataset.this.type) |
Persists this Dataset with the default storage level ( |
persist(StorageLevel) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#persist(newLevel:org.apache.spark.storage.StorageLevel):Dataset.this.type) |
Persists this Dataset with the given |
printSchema() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#printSchema():Unit) |
Prints the schema to the console in a nice tree format. |
printSchema(Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#printSchema(level:Int):Unit) |
Prints the schema up to the given level to the console in a nice tree format. |
repartition(Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#repartition(numPartitions:Int):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset that has exactly |
repartition(Int, Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#repartition(numPartitions:Int,partitionExprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset hash-partitioned by the given |
repartition(Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#repartition(partitionExprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset hash-partitioned by the given partitioning |
repartitionByRange(Int, Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#repartitionByRange(numPartitions:Int,partitionExprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset range-partitioned by the given |
repartitionByRange(Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#repartitionByRange(partitionExprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset range-partitioned by the given partitioning |
rollup(String, String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#rollup(col1:String,cols:String*):org.apache.spark.sql.RelationalGroupedDataset) |
Creates a multi-dimensional rollup for the current Dataset using column names for running aggregations. |
rollup(Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#rollup(cols:org.apache.spark.sql.Column*):org.apache.spark.sql.RelationalGroupedDataset) |
Creates a multi-dimensional rollup for the current Dataset using |
sameSemantics(Dataset[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sameSemantics(other:org.apache.spark.sql.Dataset[T]):Boolean) |
Returns |
sample(Double) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sample(fraction:Double):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset by sampling a fraction of rows (without replacement). |
sample(Double, Long) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sample(fraction:Double,seed:Long):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed. |
sample(Boolean, Double) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sample(withReplacement:Boolean,fraction:Double):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset by sampling a fraction of rows, using a random seed. |
sample(Boolean, Double, Long) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sample(withReplacement:Boolean,fraction:Double,seed:Long):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed. |
select(Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#select(cols:org.apache.spark.sql.Column*):org.apache.spark.sql.DataFrame) |
Selects a set of column-based expressions. |
select(String, String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#select(col:String,cols:String*):org.apache.spark.sql.DataFrame) |
Selects a set of columns by name. |
select[U1](TypedColumn[T, U1]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#select[U1](c1:org.apache.spark.sql.TypedColumn[T,U1]):org.apache.spark.sql.Dataset[U1]) |
Returns a new Dataset by computing the given |
selectExpr(String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame) |
Selects a set of SQL expressions. This is a variant of |
semanticHash() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#semanticHash():Int) |
Returns a |
show() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#show():Unit) |
Displays the top 20 rows of the Dataset in a tabular form. |
show(Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#show(numRows:Int):Unit) |
Displays the Dataset in a tabular form, showing |
show(Boolean) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#show(truncate:Boolean):Unit) |
Displays the top 20 rows with truncation control. |
show(Int, Boolean) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#show(numRows:Int,truncate:Boolean):Unit) |
Displays the Dataset in a tabular form with truncation control. |
show(Int, Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#show(numRows:Int,truncate:Int):Unit) |
Displays the Dataset in a tabular form with truncation to a specific character count. |
show(Int, Int, Boolean) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#show(numRows:Int,truncate:Int,vertical:Boolean):Unit) |
Displays the Dataset in a tabular form with truncation and vertical display options. |
sort(Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sort(sortExprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset sorted by the given |
sort(String, String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sort(sortCol:String,sortCols:String*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset sorted by the specified column names, all in ascending order. |
summary(String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#summary(statistics:String*):org.apache.spark.sql.DataFrame) |
Computes specified statistics for numeric and string columns. Available statistics include count, mean, stddev, min, max, arbitrary percentiles, count_distinct, and approx_count_distinct. |
tail(Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#tail(n:Int):Array[T]) |
Returns the last |
take(Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#take(n:Int):Array[T]) |
Returns the first |
takeAsList(Int) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#takeAsList(n:Int):java.util.List[T]) |
Returns the first |
to(StructType) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#to(schema:org.apache.spark.sql.types.StructType):org.apache.spark.sql.DataFrame) |
Returns a new |
toDF() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#toDF():org.apache.spark.sql.DataFrame) |
Converts this strongly typed collection of data to a generic |
toDF(String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#toDF(colNames:String*):org.apache.spark.sql.DataFrame) |
Converts this strongly typed collection of data to a generic |
transform[U](Dataset[T] => Dataset[U]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#transform[U](t:org.apache.spark.sql.Dataset[T]=%3Eorg.apache.spark.sql.Dataset[U]):org.apache.spark.sql.Dataset[U]) |
Concise syntax for chaining custom transformations. |
union(Dataset[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#union(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset containing the union of rows in this Dataset and another
Dataset. Equivalent to |
unionAll(Dataset[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#unionAll(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset containing the union of rows in this Dataset and another
Dataset. This is an alias for |
unionByName(Dataset[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#unionByName(other:org.apache.spark.sql.Dataset[T]):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset containing the union of rows in this Dataset and another Dataset. Resolves columns by name (not by position). |
unionByName(Dataset[T], Boolean) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#unionByName(other:org.apache.spark.sql.Dataset[T],allowMissingColumns:Boolean):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset containing the union of rows, with support for missing columns. Missing columns are filled with null. |
unpersist() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#unpersist():Dataset.this.type) |
Marks the Dataset as non-persistent and removes all blocks for it from memory and disk. |
unpersist(Boolean) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#unpersist(blocking:Boolean):Dataset.this.type) |
Marks the Dataset as non-persistent, optionally blocking until all blocks are deleted. |
unpivot(Array[Column], Array[Column], String, String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#unpivot(ids:Array[org.apache.spark.sql.Column],values:Array[org.apache.spark.sql.Column],variableColumnName:String,valueColumnName:String):org.apache.spark.sql.DataFrame) |
Unpivots a |
unpivot(Array[Column], String, String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#unpivot(ids:Array[org.apache.spark.sql.Column],variableColumnName:String,valueColumnName:String):org.apache.spark.sql.DataFrame) |
Unpivots a |
where(Column) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#where(condition:org.apache.spark.sql.Column):org.apache.spark.sql.Dataset[T]) |
Filters rows using the given |
where(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#where(conditionExpr:String):org.apache.spark.sql.Dataset[T]) |
Filters rows using the given SQL expression string. |
withColumn(String, Column) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame) |
Returns a new Dataset by adding a column or replacing the existing column that has the same name. |
withColumnRenamed(String, String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#withColumnRenamed(existingName:String,newName:String):org.apache.spark.sql.DataFrame) |
Returns a new Dataset with a column renamed. This is a no-op if the schema doesn’t contain the existing name. |
withColumns(Map[String, Column]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#withColumns(colsMap:Map[String,org.apache.spark.sql.Column]):org.apache.spark.sql.DataFrame) |
(Scala-specific) Returns a new Dataset by adding columns or replacing existing columns that have the same names. |
withColumns(java.util.Map[String, Column]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#withColumns(colsMap:java.util.Map[String,org.apache.spark.sql.Column]):org.apache.spark.sql.DataFrame) |
(Java-specific) Returns a new Dataset by adding columns or replacing existing columns that have the same names. |
withColumnsRenamed(Map[String, String]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#withColumnsRenamed(colsMap:Map[String,String]):org.apache.spark.sql.DataFrame) |
(Scala-specific) Returns a new Dataset with columns renamed. This is a no-op if the schema doesn’t contain the existing name. |
withColumnsRenamed(java.util.Map[String, String]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#withColumnsRenamed(colsMap:java.util.Map[String,String]):org.apache.spark.sql.DataFrame) |
(Java-specific) Returns a new Dataset with columns renamed. This is a no-op if the schema doesn’t contain the existing name. |
withMetadata(String, Metadata) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#withMetadata(columnName:String,metadata:org.apache.spark.sql.types.Metadata):org.apache.spark.sql.DataFrame) |
Returns a new Dataset by updating an existing column with metadata. |
writeTo(String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#writeTo(table:String):org.apache.spark.sql.DataFrameWriterV2[T]) |
Creates a write configuration builder for v2 sources. |
Attributes¶
Attribute |
Description |
|---|---|
columns (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#columns:Array[String]) |
Returns all column names as an |
dtypes (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#dtypes:Array[(String,String)]) |
Returns all column names and their data types as an |
encoder (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#encoder:org.apache.spark.sql.Encoder[T]) |
The |
inputFiles (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#inputFiles:Array[String]) |
Returns a best-effort snapshot of the files that compose this Dataset as an
|
isLocal (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#isLocal:Boolean) |
Returns |
isStreaming (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#isStreaming:Boolean) |
Returns |
na (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#na:org.apache.spark.sql.DataFrameNaFunctions) |
Returns a |
schema (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#schema:org.apache.spark.sql.types.StructType) |
Returns the schema of this Dataset as a |
sparkSession (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sparkSession:org.apache.spark.sql.SparkSession) |
The |
stat (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#stat:org.apache.spark.sql.DataFrameStatFunctions) |
Returns a |
storageLevel (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#storageLevel:org.apache.spark.storage.StorageLevel) |
Gets the Dataset’s current |
write (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#write:org.apache.spark.sql.DataFrameWriter[T]) |
Interface for saving the content of the non-streaming Dataset out into external
storage. Returns a |
Unsupported APIs¶
The following Dataset APIs are not currently supported in Snowpark Connect for Spark.
Method |
Description |
|---|---|
checkpoint() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#checkpoint():org.apache.spark.sql.Dataset[T]) |
Returns a checkpointed version of this Dataset. |
checkpoint(Boolean) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#checkpoint(eager:Boolean):org.apache.spark.sql.Dataset[T]) |
Returns a checkpointed version of this Dataset, optionally eager. |
dropDuplicatesWithinWatermark() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#dropDuplicatesWithinWatermark():org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset with duplicate rows removed within watermark. Streaming only. |
dropDuplicatesWithinWatermark(Seq[String]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#dropDuplicatesWithinWatermark(colNames:Seq[String]):org.apache.spark.sql.Dataset[T]) |
(Scala-specific) Returns a new Dataset with duplicate rows removed within watermark, considering only a subset of columns. Streaming only. |
dropDuplicatesWithinWatermark(Array[String]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#dropDuplicatesWithinWatermark(colNames:Array[String]):org.apache.spark.sql.Dataset[T]) |
(Java-specific) Returns a new Dataset with duplicate rows removed within watermark, considering only a subset of columns. Streaming only. |
dropDuplicatesWithinWatermark(String, String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#dropDuplicatesWithinWatermark(col1:String,cols:String*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset with duplicate rows removed within watermark, considering only a subset of columns. Streaming only. |
isEmpty (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#isEmpty:Boolean) |
Returns |
javaRDD (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#javaRDD:org.apache.spark.api.java.JavaRDD[T]) |
Returns the content of the Dataset as a |
localCheckpoint() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#localCheckpoint():org.apache.spark.sql.Dataset[T]) |
Locally checkpoints a Dataset and returns the new Dataset. |
localCheckpoint(Boolean) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#localCheckpoint(eager:Boolean):org.apache.spark.sql.Dataset[T]) |
Locally checkpoints a Dataset, optionally eager. |
randomSplit(Array[Double]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#randomSplit(weights:Array[Double]):Array[org.apache.spark.sql.Dataset[T]]) |
Randomly splits this Dataset with the provided weights. |
randomSplit(Array[Double], Long) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#randomSplit(weights:Array[Double],seed:Long):Array[org.apache.spark.sql.Dataset[T]]) |
Randomly splits this Dataset with the provided weights and seed. |
randomSplitAsList(Array[Double], Long) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#randomSplitAsList(weights:Array[Double],seed:Long):java.util.List[org.apache.spark.sql.Dataset[T]]) |
Returns a |
queryExecution (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#queryExecution:org.apache.spark.sql.execution.QueryExecution) |
The |
rdd (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#rdd:org.apache.spark.rdd.RDD[T]) |
Represents the content of the Dataset as an |
reduce(ReduceFunction[T]) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#reduce(func:org.apache.spark.api.java.function.ReduceFunction[T]):T) |
(Java-specific) Reduces the elements of this Dataset using the specified binary function. |
reduce((T, T) => T) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#reduce(func:(T,T)=%3ET):T) |
(Scala-specific) Reduces the elements of this Dataset using the specified binary function. |
sortWithinPartitions(Column*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sortWithinPartitions(sortExprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset with each partition sorted by the given |
sortWithinPartitions(String, String*) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sortWithinPartitions(sortCol:String,sortCols:String*):org.apache.spark.sql.Dataset[T]) |
Returns a new Dataset with each partition sorted by the given column names. |
sqlContext (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#sqlContext:org.apache.spark.sql.SQLContext) |
The legacy |
toJSON (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#toJSON:org.apache.spark.sql.Dataset[String]) |
Returns the content of the Dataset as a |
toJavaRDD (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#toJavaRDD:org.apache.spark.api.java.JavaRDD[T]) |
Returns the content of the Dataset as a |
toLocalIterator() (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#toLocalIterator():java.util.Iterator[T]) |
Returns a |
withWatermark(String, String) (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#withWatermark(eventTime:String,delayThreshold:String):org.apache.spark.sql.Dataset[T]) |
Defines an event time watermark for this Dataset. |
writeStream (https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/Dataset.html#writeStream:org.apache.spark.sql.streaming.DataStreamWriter[T]) |
Interface for saving the content of a streaming Dataset out into external storage.
Returns a |