Day 5 / Apr 8 (Fri)¶
Continuing the journey into the land of Spark SQL.
Exercises¶
Working on Exercises for Apache Sparkā¢ and Scala Workshops.
- Converting Arrays of Strings to String
- Using explode Standard Function
- Difference in Days Between Dates As Strings
- How to add days (as values of a column) to date?
Limiting collect_set Standard Function¶
Limiting collect_set Standard Function
slice¶
scala> all.withColumn("only_first_three", slice($"all", 1, 3)).show
+--------------------+----------------+
| all|only_first_three|
+--------------------+----------------+
| [0, 1, 2, 3, 4]| [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...| [0, 1, 2]|
+--------------------+----------------+
UDF¶
val sliceUDF = udf { (ns: Seq[Int]) => ns.take(3) }
scala> all.withColumn("only_first_three", sliceUDF($"all")).show
+--------------------+----------------+
| all|only_first_three|
+--------------------+----------------+
| [0, 1, 2, 3, 4]| [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...| [0, 1, 2]|
+--------------------+----------------+
val sliceUDF = udf { (ns: Seq[Int], begin: Int, end: Int) => ns.slice(begin, end) }
scala> all.withColumn("only_first_three", sliceUDF($"all", lit(1), lit(3))).show
+--------------------+----------------+
| all|only_first_three|
+--------------------+----------------+
| [0, 1, 2, 3, 4]| [1, 2]|
|[0, 1, 2, 3, 4, 5...| [1, 2]|
+--------------------+----------------+
val sliceUDF = udf { (ns: Seq[Int], howMany: Int) => ns.take(howMany) }
scala> all.withColumn("only_first_three", sliceUDF($"all", lit(3))).show
+--------------------+----------------+
| all|only_first_three|
+--------------------+----------------+
| [0, 1, 2, 3, 4]| [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...| [0, 1, 2]|
+--------------------+----------------+
filter Standard Function¶
scala> all.withColumn("only_first_three", filter($"all", (x, idx) => idx < 3 )).show
+--------------------+----------------+
| all|only_first_three|
+--------------------+----------------+
| [0, 1, 2, 3, 4]| [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...| [0, 1, 2]|
+--------------------+----------------+
import org.apache.spark.sql.Column
val krzysiekFilter: (Column, Column) => Column = (x, idx) => idx < 3
scala> all.withColumn("only_first_three", filter($"all", krzysiekFilter)).show
+--------------------+----------------+
| all|only_first_three|
+--------------------+----------------+
| [0, 1, 2, 3, 4]| [0, 1, 2]|
|[0, 1, 2, 3, 4, 5...| [0, 1, 2]|
+--------------------+----------------+
Theory¶
Homework¶
Reading¶
- Read the scaladoc of the following types in Spark SQL:
- Read Datetime Patterns for Formatting and Parsing
Exercise¶
Schedule¶
- 8:30 - 9:00 Introduction
- 9:00 - 10:00 Exercises
- 10:00 - 10:30 Discussion
- 10:30 - 10:40 Break
- 11:50 - 12:40pm Lunch break
- 12:40pm - 2:30pm Exercises