Skip to content

Dataset API — Untyped Transformations

Untyped transformations are part of the Dataset API for transforming a Dataset to a DataFrame, a Column, a RelationalGroupedDataset, a DataFrameNaFunctions or a DataFrameStatFunctions (and hence untyped).

Note

Untyped transformations are the methods in the Dataset Scala class that are grouped in untypedrel group name, i.e. @group untypedrel.

[[methods]] .Dataset API's Untyped Transformations [cols="1,2",options="header",width="100%"] |=== | Transformation | Description

| <> a|

[source, scala]

agg(aggExpr: (String, String), aggExprs: (String, String)): DataFrame agg(expr: Column, exprs: Column): DataFrame agg(exprs: Map[String, String]): DataFrame


| <> a| Selects a column based on the column name (i.e. maps a Dataset onto a Column)

[source, scala]

apply(colName: String): Column

| <> a| Selects a column based on the column name (i.e. maps a Dataset onto a Column)

[source, scala]

col(colName: String): Column

| <> a|

[source, scala]

colRegex(colName: String): Column

Selects a column based on the column name specified as a regex (i.e. maps a Dataset onto a Column)

| <> a|

[source, scala]

crossJoin(right: Dataset[_]): DataFrame

| <> a|

[source, scala]

cube(cols: Column*): RelationalGroupedDataset cube(col1: String, cols: String*): RelationalGroupedDataset


| <> a|

[source, scala]

drop(colName: String): DataFrame drop(colNames: String*): DataFrame drop(col: Column): DataFrame


| <> a|

[source, scala]

groupBy(cols: Column*): RelationalGroupedDataset groupBy(col1: String, cols: String*): RelationalGroupedDataset


| <> a|

[source, scala]

join(right: Dataset[]): DataFrame join(right: Dataset[], usingColumn: String): DataFrame join(right: Dataset[], usingColumns: Seq[String]): DataFrame join(right: Dataset[], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[], joinExprs: Column): DataFrame join(right: Dataset[], joinExprs: Column, joinType: String): DataFrame


| <> a|

[source, scala]

na: DataFrameNaFunctions

| <> a|

[source, scala]

rollup(cols: Column*): RelationalGroupedDataset rollup(col1: String, cols: String*): RelationalGroupedDataset


| <> a|

[source, scala]

select(cols: Column*): DataFrame select(col: String, cols: String*): DataFrame


| <> a|

[source, scala]

selectExpr(exprs: String*): DataFrame

| <> a|

[source, scala]

stat: DataFrameStatFunctions

| <> a|

[source, scala]

withColumn(colName: String, col: Column): DataFrame

| <> a|

[source, scala]

withColumnRenamed(existingName: String, newName: String): DataFrame

|===

=== [[agg]] agg Untyped Transformation

[source, scala]

agg(aggExpr: (String, String), aggExprs: (String, String)): DataFrame agg(expr: Column, exprs: Column): DataFrame agg(exprs: Map[String, String]): DataFrame


agg...FIXME

=== [[apply]] apply Untyped Transformation

[source, scala]

apply(colName: String): Column

apply selects a column based on the column name (i.e. maps a Dataset onto a Column).

=== [[col]] col Untyped Transformation

[source, scala]

col(colName: String): Column

col selects a column based on the column name (i.e. maps a Dataset onto a Column).

Internally, col branches off per the input column name.

If the column name is * (a star), col simply creates a <> with <> expression (with the <> of the analyzed logical plan of the QueryExecution).

Otherwise, col uses <> untyped transformation when spark.sql.parser.quotedRegexColumnNames configuration property is enabled.

In the case when the column name is not * and spark.sql.parser.quotedRegexColumnNames configuration property is disabled, col creates a <> with the column name <> (as a <>).

=== [[colRegex]] colRegex Untyped Transformation

[source, scala]

colRegex(colName: String): Column

colRegex selects a column based on the column name specified as a regex (i.e. maps a Dataset onto a Column).

NOTE: colRegex is used in <> when spark.sql.parser.quotedRegexColumnNames configuration property is enabled (and the column name is not *).

Internally, colRegex matches the input column name to different regular expressions (in the order):

. For column names with quotes without a qualifier, colRegex simply creates a <> with a <> (with no table)

. For column names with quotes with a qualifier, colRegex simply creates a <> with a <> (with a table specified)

. For other column names, colRegex (behaves like <> and) creates a <> with the column name <> (as a <>)

=== [[crossJoin]] crossJoin Untyped Transformation

[source, scala]

crossJoin(right: Dataset[_]): DataFrame

crossJoin...FIXME

=== [[cube]] cube Untyped Transformation

[source, scala]

cube(cols: Column*): RelationalGroupedDataset cube(col1: String, cols: String*): RelationalGroupedDataset


cube...FIXME

=== [[drop]] Dropping One or More Columns -- drop Untyped Transformation

[source, scala]

drop(colName: String): DataFrame drop(colNames: String*): DataFrame drop(col: Column): DataFrame


drop...FIXME

=== [[groupBy]] groupBy Untyped Transformation

[source, scala]

groupBy(cols: Column*): RelationalGroupedDataset groupBy(col1: String, cols: String*): RelationalGroupedDataset


groupBy...FIXME

=== [[join]] join Untyped Transformation

[source, scala]

join(right: Dataset[]): DataFrame join(right: Dataset[], usingColumn: String): DataFrame join(right: Dataset[], usingColumns: Seq[String]): DataFrame join(right: Dataset[], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[], joinExprs: Column): DataFrame join(right: Dataset[], joinExprs: Column, joinType: String): DataFrame


join...FIXME

=== [[na]] na Untyped Transformation

[source, scala]

na: DataFrameNaFunctions

na simply creates a <> to work with missing data.

=== [[rollup]] rollup Untyped Transformation

[source, scala]

rollup(cols: Column*): RelationalGroupedDataset rollup(col1: String, cols: String*): RelationalGroupedDataset


rollup...FIXME

=== [[select]] select Untyped Transformation

[source, scala]

select(cols: Column*): DataFrame select(col: String, cols: String*): DataFrame


select...FIXME

=== [[selectExpr]] Projecting Columns using SQL Statements -- selectExpr Untyped Transformation

[source, scala]

selectExpr(exprs: String*): DataFrame

selectExpr is like select, but accepts SQL statements.

[source, scala]

val ds = spark.range(5)

scala> ds.selectExpr("rand() as random").show 16/04/14 23:16:06 INFO HiveSqlParser: Parsing command: rand() as random +-------------------+ | random| +-------------------+ | 0.887675894185651| |0.36766085091074086| | 0.2700020856675186| | 0.1489033635529543| | 0.5862990791950973| +-------------------+


Internally, it executes select with every expression in exprs mapped to spark-sql-Column.md[Column] (using spark-sql-SparkSqlParser.md[SparkSqlParser.parseExpression]).

[source, scala]

scala> ds.select(expr("rand() as random")).show +------------------+ | random| +------------------+ |0.5514319279894851| |0.2876221510433741| |0.4599999092045741| |0.5708558868374893| |0.6223314406247136| +------------------+


=== [[stat]] stat Untyped Transformation

[source, scala]

stat: DataFrameStatFunctions

stat simply creates a <> to work with statistic functions.

=== [[withColumn]] withColumn Untyped Transformation

[source, scala]

withColumn(colName: String, col: Column): DataFrame

withColumn...FIXME

=== [[withColumnRenamed]] withColumnRenamed Untyped Transformation

[source, scala]

withColumnRenamed(existingName: String, newName: String): DataFrame

withColumnRenamed...FIXME


Last update: 2020-11-16