Skip to content

Day 5 / Apr 8 (Fri)

Continuing the journey into the land of Spark SQL.

Exercises

Working on Exercises for Apache Sparkā„¢ and Scala Workshops.

  1. Converting Arrays of Strings to String
  2. Using explode Standard Function
  3. Difference in Days Between Dates As Strings
  4. How to add days (as values of a column) to date?

Limiting collect_set Standard Function

Limiting collect_set Standard Function

slice

scala> all.withColumn("only_first_three", slice($"all", 1, 3)).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|       [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...|       [0, 1, 2]|
+--------------------+----------------+

UDF

val sliceUDF = udf { (ns: Seq[Int]) => ns.take(3) }
scala> all.withColumn("only_first_three", sliceUDF($"all")).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|       [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...|       [0, 1, 2]|
+--------------------+----------------+

val sliceUDF = udf { (ns: Seq[Int], begin: Int, end: Int) => ns.slice(begin, end) }
scala> all.withColumn("only_first_three", sliceUDF($"all", lit(1), lit(3))).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|          [1, 2]|
|[0, 1, 2, 3, 4, 5...|          [1, 2]|
+--------------------+----------------+

val sliceUDF = udf { (ns: Seq[Int], howMany: Int) => ns.take(howMany) }
scala> all.withColumn("only_first_three", sliceUDF($"all", lit(3))).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|       [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...|       [0, 1, 2]|
+--------------------+----------------+

filter Standard Function

scala> all.withColumn("only_first_three", filter($"all", (x, idx) => idx < 3 )).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|       [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...|       [0, 1, 2]|
+--------------------+----------------+
import org.apache.spark.sql.Column
val krzysiekFilter: (Column, Column) => Column = (x, idx) => idx < 3
scala> all.withColumn("only_first_three", filter($"all", krzysiekFilter)).show
+--------------------+----------------+
|                 all|only_first_three|
+--------------------+----------------+
|     [0, 1, 2, 3, 4]|       [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...|       [0, 1, 2]|
+--------------------+----------------+

Theory

  1. Basic Aggregation

Homework

Reading

  1. Read the scaladoc of the following types in Spark SQL:
  2. Read Datetime Patterns for Formatting and Parsing

Exercise

  1. Using upper Standard Function

Schedule

  1. 8:30 - 9:00 Introduction
  2. 9:00 - 10:00 Exercises
  3. 10:00 - 10:30 Discussion
  4. 10:30 - 10:40 Break
  5. 11:50 - 12:40pm Lunch break
  6. 12:40pm - 2:30pm Exercises
Back to top