Day 4 / Apr 7 (Thu)¶

Spark SQL Exercises¶

Working on Exercises for Apache Spark™ and Scala Workshops

split function with variable delimiter per row¶

split function with variable delimiter per row

scala> val removeEmptyElements = (array: Seq[String]) => array.filter(e => !e.isEmpty)
removeEmptyElements: Seq[String] => Seq[String] = $Lambda$4378/0x0000000801843840@580792a3

scala> val removeEmptyElementsUDF = udf { (array: Seq[String]) => array.filter(e => !e.isEmpty) }
removeEmptyElementsUDF: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$4379/0x0000000801844840@23ffb040,ArrayType(StringType,true),List(Some(class[value[0]: array<string>])),Some(class[value[0]: array<string>]),None,true,true)

scala> split_values.withColumn("split_values", removeEmptyElementsUDF($"split_values")).show(false)
+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0]       |
|0@1000.0@          |@        |[0, 1000.0]           |
|1$                 |$        |[1]                   |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Exercise: Using Dataset.flatMap Operator¶

Exercise: Using Dataset.flatMap Operator

nums.flatMap { r =>
  val ns = r.getSeq[Int](0)
  ns.map(n => (ns, n))
}.toDF("nums", "num").show

Exercise: Flattening Array Columns¶

Exercise: Flattening Array Columns (From Datasets of Arrays to Datasets of Array Elements)

Scala / Implicit Conversions¶

Implicit Conversions

Homework¶

Read the scaladoc of the following types in Spark SQL:
- org.apache.spark.sql.Column
- org.apache.spark.sql.Row

Schedule¶

8:30am - 11:50am Exercises
11:50 - 12:40pm Lunch break
12:40pm - 2:30pm Exercises