Spark SQL

Apache Spark 3.2



@jaceklaskowski / StackOverflow / GitHub / LinkedIn
The "Internals" Books: books.japila.pl

Spark SQL

  1. Data Processing with Structured Queries on Massive Scale
    • SQL-like Relational Queries
    • Distributed computations (RDD API)
  2. High-level "front-end" languages
    • SQL, Scala, Java, Python, R
  3. Low-level "backend" Logical Operators
    • Logical and physical query plans
  4. Dataset (and DataFrame) data abstractions
  5. Encoder for storage- and performance optimizations
    • Reducing garbage collections
  6. Switch to The Internals of Spark SQL

SparkSession

  1. The entry point to Spark SQL
  2. Use SparkSession.builder Fluent API to build one
  3. Loading datasets using load discussed later
  4. spark-shell gives you one instance as spark
  5. Switch to The Internals of Spark SQL

DataSource API — Reading and Writing Datasets

  1. Loading datasets using SparkSession.read
    • dataset (lowercase) = data for processing
  2. Writing Datasets using Dataset.write
    • Dataset (uppercase) = a distributed computation
  3. Loading and writing operators create source and sink nodes in a data flow graph
  4. Pluggable API
  5. Switch to The Internals of Spark SQL

Reading/Loading Datasets


      val dataset = spark.read.format("csv").load("csvs/*")
            
  1. SparkSession.read — DataFrameReader
    1. format
    2. option and options
    3. schema
    4. load
    5. format-specific loading methods discussed on next slide
  2. Switch to The Internals of Spark SQL

Built-In Data Sources (Formats)

  1. File Formats
    • csv
    • json
    • orc
    • parquet
    • text and textFile
  2. jdbc
  3. table

Writing/Saving Datasets


          dataset.write.format("json").save("dailies")
            
  1. DataFrame.write — DataFrameWriter
    1. format
    2. mode
    3. option and options
    4. partitionBy, bucketBy and sortBy
    5. insertInto, save and saveAsTable
  2. Switch to The Internals of Spark SQL

Schema

  1. Schema = StructType with one or many StructFields
  2. Implicit (inferred) or explicit
  3. dataset.printSchema
  4. Schema is your case class(es)
    
      case class Person(id: Long, name: String)
    
      import org.apache.spark.sql.Encoders
      val schema = Encoders.product[Person].schema
                  
  5. Switch to The Internals of Spark SQL

Ad-hoc Local Datasets

  1. Seq(...).toDF("col1", "col2", ...) for local DataFrames
  2. Seq(...).toDS for local Datasets
  3. Use import spark.implicits._
  4. All Scala Collections supported (almost)
  5. Switch to The Internals of Spark SQL

Demo: Creating Spark SQL Application

  1. Use IntelliJ IDEA and sbt
  2. Define Spark SQL dependency in build.sbt
    • libraryDependencies
  3. Write your Spark SQL code
    • spark.version
  4. Execute sbt package
  5. Run the application using spark-submit