The Core of Apache Spark

Apache Spark 2.4.1 / Spark Core

@jaceklaskowski / StackOverflow / GitHub
The "Internals" Books: Apache Spark / Spark SQL / Spark Structured Streaming

SparkContext — The Entry Point to Spark Services

  1. SparkContext is the entry point to the Spark services in your Spark application
  2. SparkContext manages the connection to a Spark execution environment
    • Defined using the master URL
  3. Switch to Mastering Apache Spark 2

Partitions

  1. Partitions are logical buckets for data.
  2. Partitions correspond to Hadoop's splits (if the data lives on HDFS) or partitioning schemes in the source storage
  3. RDD (and hence the data inside) is partitioned.
  4. Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
  5. Data in partitions can be skewed, i.e. unevenly distributed across partitions.

RDD Operators

  1. Transformation is a lazy RDD operation that creates one or many RDDs
  2. Action is a RDD operation that produces non-RDD Scala values
  3. Switch to Mastering Apache Spark 2

Shuffle

  1. Shuffle is Spark's mechanism for re-distributing data so that it’s grouped differently across partitions.
  2. Data is often distributed unevenly across partitions.
  3. repartition and coalesce operators can repartition a dataset.
  4. Switch to Mastering Apache Spark 2

DAGScheduler

Jobs

Stages

Tasks