Skip to content

Dataset API — Untyped Transformations

Untyped transformations are part of the Dataset API for transforming a Dataset to a DataFrame, a Column, a RelationalGroupedDataset, a DataFrameNaFunctions or a DataFrameStatFunctions (and hence untyped).

Note

Untyped transformations are the methods in the Dataset Scala class that are grouped in untypedrel group name, i.e. @group untypedrel.

[[methods]] .Dataset API's Untyped Transformations [cols="1,2",options="header",width="100%"] |=== | Transformation | Description

| <> a|

[source, scala]

agg(aggExpr: (String, String), aggExprs: (String, String)): DataFrame agg(expr: Column, exprs: Column): DataFrame agg(exprs: Map[String, String]): DataFrame


| <> a| Selects a column based on the column name (i.e. maps a Dataset onto a Column)

[source, scala]

apply(colName: String): Column

| <> a| Selects a column based on the column name (i.e. maps a Dataset onto a Column)

[source, scala]

col(colName: String): Column

| <> a|

[source, scala]

colRegex(colName: String): Column

Selects a column based on the column name specified as a regex (i.e. maps a Dataset onto a Column)

| <> a|

[source, scala]

crossJoin(right: Dataset[_]): DataFrame

| <> a|

[source, scala]

cube(cols: Column*): RelationalGroupedDataset cube(col1: String, cols: String*): RelationalGroupedDataset


| <> a|

[source, scala]

drop(colName: String): DataFrame drop(colNames: String*): DataFrame drop(col: Column): DataFrame


| <> a|

[source, scala]

groupBy(cols: Column*): RelationalGroupedDataset groupBy(col1: String, cols: String*): RelationalGroupedDataset


| <> a|

[source, scala]

join(right: Dataset[]): DataFrame join(right: Dataset[], usingColumn: String): DataFrame join(right: Dataset[], usingColumns: Seq[String]): DataFrame join(right: Dataset[], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[], joinExprs: Column): DataFrame join(right: Dataset[], joinExprs: Column, joinType: String): DataFrame


| <> a|

[source, scala]

na: DataFrameNaFunctions

| <> a|

[source, scala]

rollup(cols: Column*): RelationalGroupedDataset rollup(col1: String, cols: String*): RelationalGroupedDataset


| <> a|

[source, scala]

select(cols: Column*): DataFrame select(col: String, cols: String*): DataFrame


| <> a|

[source, scala]

selectExpr(exprs: String*): DataFrame

| <> a|

[source, scala]

stat: DataFrameStatFunctions

| <> a|

[source, scala]

withColumn(colName: String, col: Column): DataFrame

| <> a|

[source, scala]

withColumnRenamed(existingName: String, newName: String): DataFrame

|===

=== [[agg]] agg Untyped Transformation

[source, scala]

agg(aggExpr: (String, String), aggExprs: (String, String)): DataFrame agg(expr: Column, exprs: Column): DataFrame agg(exprs: Map[String, String]): DataFrame


agg...FIXME

=== [[apply]] apply Untyped Transformation

[source, scala]

apply(colName: String): Column

apply selects a column based on the column name (i.e. maps a Dataset onto a Column).

=== [[col]] col Untyped Transformation

col(
  colName: String): Column

col selects a column based on the column name (i.e. maps a Dataset onto a Column).

Internally, col branches off per the input column name.

If the column name is * (a star), col simply creates a Column with ResolvedStar expression (with the schema output attributes of the analyzed logical plan of the QueryExecution).

Otherwise, col uses colRegex untyped transformation when spark.sql.parser.quotedRegexColumnNames configuration property is enabled.

In the case when the column name is not * and spark.sql.parser.quotedRegexColumnNames configuration property is disabled, col creates a Column with the column name resolved (as a NamedExpression).

=== [[colRegex]] colRegex Untyped Transformation

colRegex(
  colName: String): Column

colRegex selects a column based on the column name specified as a regex (i.e. maps a Dataset onto a Column).

Note

colRegex is used in col when spark.sql.parser.quotedRegexColumnNames configuration property is enabled (and the column name is not *).

Internally, colRegex matches the input column name to different regular expressions (in the order):

  1. For column names with quotes without a qualifier, colRegex simply creates a Column with a UnresolvedRegex (with no table)

  2. For column names with quotes with a qualifier, colRegex simply creates a Column with a UnresolvedRegex (with a table specified)

  3. For other column names, colRegex (behaves like col and) creates a Column with the column name resolved (as a NamedExpression)

=== [[crossJoin]] crossJoin Untyped Transformation

[source, scala]

crossJoin(right: Dataset[_]): DataFrame

crossJoin...FIXME

=== [[cube]] cube Untyped Transformation

[source, scala]

cube(cols: Column*): RelationalGroupedDataset cube(col1: String, cols: String*): RelationalGroupedDataset


cube...FIXME

=== [[drop]] Dropping One or More Columns -- drop Untyped Transformation

[source, scala]

drop(colName: String): DataFrame drop(colNames: String*): DataFrame drop(col: Column): DataFrame


drop...FIXME

=== [[groupBy]] groupBy Untyped Transformation

[source, scala]

groupBy(cols: Column*): RelationalGroupedDataset groupBy(col1: String, cols: String*): RelationalGroupedDataset


groupBy...FIXME

=== [[join]] join Untyped Transformation

[source, scala]

join(right: Dataset[]): DataFrame join(right: Dataset[], usingColumn: String): DataFrame join(right: Dataset[], usingColumns: Seq[String]): DataFrame join(right: Dataset[], usingColumns: Seq[String], joinType: String): DataFrame join(right: Dataset[], joinExprs: Column): DataFrame join(right: Dataset[], joinExprs: Column, joinType: String): DataFrame


join...FIXME

=== [[na]] na Untyped Transformation

[source, scala]

na: DataFrameNaFunctions

na simply creates a <> to work with missing data.

=== [[rollup]] rollup Untyped Transformation

[source, scala]

rollup(cols: Column*): RelationalGroupedDataset rollup(col1: String, cols: String*): RelationalGroupedDataset


rollup...FIXME

=== [[select]] select Untyped Transformation

[source, scala]

select(cols: Column*): DataFrame select(col: String, cols: String*): DataFrame


select...FIXME

=== [[selectExpr]] Projecting Columns using SQL Statements -- selectExpr Untyped Transformation

selectExpr(
  exprs: String*): DataFrame

selectExpr is like select, but accepts SQL statements.

val ds = spark.range(5)

scala> ds.selectExpr("rand() as random").show
16/04/14 23:16:06 INFO HiveSqlParser: Parsing command: rand() as random
+-------------------+
|             random|
+-------------------+
|  0.887675894185651|
|0.36766085091074086|
| 0.2700020856675186|
| 0.1489033635529543|
| 0.5862990791950973|
+-------------------+

Internally, it executes select with every expression in exprs mapped to Column (using SparkSqlParser.parseExpression).

scala> ds.select(expr("rand() as random")).show
+------------------+
|            random|
+------------------+
|0.5514319279894851|
|0.2876221510433741|
|0.4599999092045741|
|0.5708558868374893|
|0.6223314406247136|
+------------------+

=== [[stat]] stat Untyped Transformation

[source, scala]

stat: DataFrameStatFunctions

stat simply creates a <> to work with statistic functions.

=== [[withColumn]] withColumn Untyped Transformation

[source, scala]

withColumn(colName: String, col: Column): DataFrame

withColumn...FIXME

=== [[withColumnRenamed]] withColumnRenamed Untyped Transformation

[source, scala]

withColumnRenamed(existingName: String, newName: String): DataFrame

withColumnRenamed...FIXME

Back to top