spark-workshop

Adding count to the source DataFrame

Write a structured query that adds aggregations to the input dataset.

EXTRA Can you think of another query to solve it? What’s more efficient?

Module: Spark SQL

Duration: 15 mins

Input Dataset

val input = Seq(
  ("05:49:56.604899", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604900", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604899", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604900", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604899", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604900", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604899", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604900", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604899", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604900", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604899", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604900", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604899", "10.0.0.2.54880", "10.0.0.3.5001",  2),
  ("05:49:56.604908", "10.0.0.3.5001",  "10.0.0.2.54880", 2),
  ("05:49:56.604908", "10.0.0.3.5001",  "10.0.0.2.54880", 2),
  ("05:49:56.604908", "10.0.0.3.5001",  "10.0.0.2.54880", 2),
  ("05:49:56.604908", "10.0.0.3.5001",  "10.0.0.2.54880", 2),
  ("05:49:56.604908", "10.0.0.3.5001",  "10.0.0.2.54880", 2),
  ("05:49:56.604908", "10.0.0.3.5001",  "10.0.0.2.54880", 2),
  ("05:49:56.604908", "10.0.0.3.5001",  "10.0.0.2.54880", 2)).toDF("column0", "column1", "column2", "label")
scala> input.show
+---------------+--------------+--------------+-----+
|        column0|       column1|       column2|label|
+---------------+--------------+--------------+-----+
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|
+---------------+--------------+--------------+-----+

Result

scala> solution.show
+---------------+--------------+--------------+-----+-----+
|        column0|       column1|       column2|label|count|
+---------------+--------------+--------------+-----+-----+
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604900|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604899|10.0.0.2.54880| 10.0.0.3.5001|    2|   13|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|    7|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|    7|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|    7|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|    7|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|    7|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|    7|
|05:49:56.604908| 10.0.0.3.5001|10.0.0.2.54880|    2|    7|
+---------------+--------------+--------------+-----+-----+