Day 5 / Apr 8 (Fri)¶

Continuing the journey into the land of Spark SQL.

Exercises¶

Working on Exercises for Apache Spark™ and Scala Workshops.

Limiting collect_set Standard Function¶

slice¶

scala> all.withColumn("only_first_three", slice($"all", 1, 3)).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|       [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...|       [0, 1, 2]|
+--------------------+----------------+

UDF¶

val sliceUDF = udf { (ns: Seq[Int]) => ns.take(3) }

scala> all.withColumn("only_first_three", sliceUDF($"all")).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|       [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...|       [0, 1, 2]|
+--------------------+----------------+

val sliceUDF = udf { (ns: Seq[Int], begin: Int, end: Int) => ns.slice(begin, end) }

scala> all.withColumn("only_first_three", sliceUDF($"all", lit(1), lit(3))).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|          [1, 2]|
|[0, 1, 2, 3, 4, 5...|          [1, 2]|
+--------------------+----------------+

val sliceUDF = udf { (ns: Seq[Int], howMany: Int) => ns.take(howMany) }

scala> all.withColumn("only_first_three", sliceUDF($"all", lit(3))).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|       [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...|       [0, 1, 2]|
+--------------------+----------------+

filter Standard Function¶

scala> all.withColumn("only_first_three", filter($"all", (x, idx) => idx < 3 )).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|       [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...|       [0, 1, 2]|
+--------------------+----------------+

import org.apache.spark.sql.Column
val krzysiekFilter: (Column, Column) => Column = (x, idx) => idx < 3

scala> all.withColumn("only_first_three", filter($"all", krzysiekFilter)).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|       [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...|       [0, 1, 2]|
+--------------------+----------------+

Theory¶

Basic Aggregation

Homework¶

Reading¶

Read the scaladoc of the following types in Spark SQL:
- org.apache.spark.sql.RelationalGroupedDataset
- org.apache.spark.sql.types.StructType
Read Datetime Patterns for Formatting and Parsing

Exercise¶

Using upper Standard Function

Schedule¶

8:30 - 9:00 Introduction
9:00 - 10:00 Exercises
10:00 - 10:30 Discussion
10:30 - 10:40 Break
11:50 - 12:40pm Lunch break
12:40pm - 2:30pm Exercises