Day 4 / Apr 7 (Thu)¶
Spark SQL Exercises¶
Working on Exercises for Apache Sparkā¢ and Scala Workshops
split function with variable delimiter per row¶
split function with variable delimiter per row
scala> val removeEmptyElements = (array: Seq[String]) => array.filter(e => !e.isEmpty)
removeEmptyElements: Seq[String] => Seq[String] = $Lambda$4378/0x0000000801843840@580792a3
scala> val removeEmptyElementsUDF = udf { (array: Seq[String]) => array.filter(e => !e.isEmpty) }
removeEmptyElementsUDF: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$4379/0x0000000801844840@23ffb040,ArrayType(StringType,true),List(Some(class[value[0]: array<string>])),Some(class[value[0]: array<string>]),None,true,true)
scala> split_values.withColumn("split_values", removeEmptyElementsUDF($"split_values")).show(false)
+-------------------+---------+----------------------+
|VALUES |Delimiter|split_values |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0] |
|0@1000.0@ |@ |[0, 1000.0] |
|1$ |$ |[1] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
Exercise: Using Dataset.flatMap Operator¶
Exercise: Using Dataset.flatMap Operator
nums.flatMap { r =>
val ns = r.getSeq[Int](0)
ns.map(n => (ns, n))
}.toDF("nums", "num").show
Exercise: Flattening Array Columns¶
Exercise: Flattening Array Columns (From Datasets of Arrays to Datasets of Array Elements)
Scala / Implicit Conversions¶
Homework¶
- Read the scaladoc of the following types in Spark SQL:
Schedule¶
- 8:30am - 11:50am Exercises
- 11:50 - 12:40pm Lunch break
- 12:40pm - 2:30pm Exercises