The Core of Apache Spark

SparkContext — The Entry Point to Spark Services

SparkContext is the entry point to the Spark services in your Spark application
SparkContext manages the connection to a Spark execution environment
- Defined using the master URL
Switch to Mastering Apache Spark 2
- SparkContext — Entry Point to Spark (Core)

Partitions are logical buckets for data.
Partitions correspond to Hadoop's splits (if the data lives on HDFS) or partitioning schemes in the source storage
RDD (and hence the data inside) is partitioned.
Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
Data in partitions can be skewed, i.e. unevenly distributed across partitions.

Shuffle is Spark's mechanism for re-distributing data so that it’s grouped differently across partitions.
Data is often distributed unevenly across partitions.
repartition and coalesce operators can repartition a dataset.
Switch to Mastering Apache Spark 2
- RDD Dependencies
- RDD Shuffle