Spark SQL

Data Processing with Structured Queries on Massive Scale
- SQL-like Relational Queries
- Distributed computations (RDD API)
High-level "front-end" languages
- SQL, Scala, Java, Python, R
Low-level "backend" Logical Operators
- Logical and physical query plans
Dataset (and DataFrame) data abstractions
Encoder for storage- and performance optimizations
- Reducing garbage collections
Switch to The Internals of Spark SQL
- Spark SQL — Structured Data Processing with Relational Queries on Massive Scale

SparkSession

The entry point to Spark SQL
Use SparkSession.builder Fluent API to build one
Loading datasets using load discussed later
spark-shell gives you one instance as spark
Switch to The Internals of Spark SQL
- SparkSession — The Entry Point to Spark SQL

Loading datasets using SparkSession.read
- dataset (lowercase) = data for processing
Writing Datasets using Dataset.write
- Dataset (uppercase) = a distributed computation
Loading and writing operators create source and sink nodes in a data flow graph
Pluggable API
Switch to The Internals of Spark SQL
- DataSource — Pluggable Data Provider Framework


      val dataset = spark.read.format("csv").load("csvs/*")

SparkSession.read — DataFrameReader
1. format
2. option and options
3. schema
4. load
5. format-specific loading methods discussed on next slide
Switch to The Internals of Spark SQL
- DataFrameReader — Reading Datasets from External Data Sources


          dataset.write.format("json").save("dailies")

Schema is your case class(es)


  case class Person(id: Long, name: String)

  import org.apache.spark.sql.Encoders
  val schema = Encoders.product[Person].schema