spark-workshop

Exercise: Why are all fields null when querying with schema?

Write a structured query (using spark-shell or Databricks Community Edition) that loads a dataset with a proper schema with timestamp and prints out the rows to the standard output:

2019-07-22 00:10:15,030|10.29.2.6|
2019-07-22 00:10:15,334|10.1.198.41|
2019-07-22 00:10:15,400|10.1.198.41|
2019-07-22 00:10:15,511|10.1.198.41|
2019-07-22 00:10:16,911|10.1.198.41|

Protip™: Use CSV data source

Module: Spark SQL

Duration: 30 mins

Result

scala> solution.printSchema
root
 |-- dateTime: timestamp (nullable = true)
 |-- IP: string (nullable = true)

scala> solution.show(truncate = false)
+-----------------------+-----------+
|dateTime               |IP         |
+-----------------------+-----------+
|2019-07-22 00:10:15.03 |10.29.2.6  |
|2019-07-22 00:10:15.334|10.1.198.41|
|2019-07-22 00:10:15.4  |10.1.198.41|
|2019-07-22 00:10:15.511|10.1.198.41|
|2019-07-22 00:10:16.911|10.1.198.41|
+-----------------------+-----------+

NOTE: The types!

Credits