spark-workshop

Exercise: Finding Ids of Rows with Word in Array Column

Develop a standalone Spark SQL application (using IntelliJ IDEA) that finds the ids of the rows that have values of one column in an array column.

Protip™: Use split and explode standard functions

Module: Spark SQL

Duration: 30 mins

Input Dataset

+---+------------------+-----+
| id|             words| word|
+---+------------------+-----+
|  1|     one,two,three|  one|
|  2|     four,one,five|  six|
|  3|seven,nine,one,two|eight|
|  4|    two,three,five| five|
|  5|      six,five,one|seven|
+---+------------------+-----+
id,words,word
1,"one,two,three",one
2,"four,one,five",six
3,"seven,nine,one,two",eight
4,"two,three,five",five
5,"six,five,one",seven

Result

+-----+------------+
|    w|         ids|
+-----+------------+
| five|   [2, 4, 5]|
|  one|[1, 2, 3, 5]|
|seven|         [3]|
|  six|         [5]|
+-----+------------+

The word “one” is in the rows with the ids 1, 2, 3 and 5.

The word “seven” is in the row with the id 3.