Skip to content

SharedState — State Shared Across SparkSessions

SharedState holds the state that can be shared across SparkSessions:

SharedState is shared when SparkSession is created using SparkSession.newSession:

assert(spark.sharedState == spark.newSession.sharedState)

Creating Instance

SharedState takes the following to be created:

  • SparkContext
  • Initial configuration properties

SharedState is created for SparkSession (and cached for later reuse).

Accessing SharedState

SharedState is available using SparkSession.sharedState.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sharedState
org.apache.spark.sql.internal.SharedState

Shared SQL Services

ExternalCatalog

externalCatalog: ExternalCatalog

ExternalCatalog that is created reflectively based on spark.sql.catalogImplementation internal configuration property:

While initialized:

  1. Creates the default database (with default database description and warehousePath location) unless available already.

  2. Registers a ExternalCatalogEventListener that propagates external catalog events to the Spark listener bus.

GlobalTempViewManager

globalTempViewManager: GlobalTempViewManager

GlobalTempViewManager

When accessed for the very first time, globalTempViewManager gets the name of the global temporary view database based on spark.sql.globalTempDatabase internal static configuration property.

In the end, globalTempViewManager creates a new GlobalTempViewManager (with the configured database name).

globalTempViewManager throws a SparkException when the global temporary view database exist in the ExternalCatalog:

[globalTempDB] is a system preserved database, please rename your existing database to resolve the name conflict, or set a different value for spark.sql.globalTempDatabase, and launch your Spark application again.

globalTempViewManager is used when BaseSessionStateBuilder and HiveSessionStateBuilder are requested for a SessionCatalog.

SQLAppStatusStore

statusStore: SQLAppStatusStore

SharedState creates a SQLAppStatusStore when created.

When initialized, statusStore requests the SparkContext for AppStatusStore that is then requested for the KVStore (which is assumed a ElementTrackingStore).

statusStore creates a SQLAppStatusListener (with the live flag on) and registers it with the LiveListenerBus to application status queue.

statusStore creates a SQLAppStatusStore (with the KVStore and the SQLAppStatusListener).

In the end, statusStore creates a SQLTab (with the SQLAppStatusStore and the SparkUI if available).

externalCatalogClassName Internal Method

externalCatalogClassName(
  conf: SparkConf): String

externalCatalogClassName gives the name of the class of the ExternalCatalog implementation based on spark.sql.catalogImplementation configuration property:

externalCatalogClassName is used when SharedState is requested for the ExternalCatalog.

Warehouse Location

warehousePath: String

Warning

This is no longer part of SharedState and will go away once I find out where. Your help is appreciated.

warehousePath is the location of the warehouse.

warehousePath is hive.metastore.warehouse.dir (if defined) or spark.sql.warehouse.dir.

warehousePath prints out the following INFO message to the logs when SharedState is created:

Warehouse path is '[warehousePath]'.

warehousePath is used when SharedState initializes ExternalCatalog (and creates the default database in the metastore).

While initialized, warehousePath does the following:

  1. Loads hive-site.xml when found on CLASSPATH, i.e. adds it as a configuration resource to Hadoop's http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/conf/Configuration.html[Configuration] (of SparkContext).

  2. Removes hive.metastore.warehouse.dir from SparkConf (of SparkContext) and leaves it off if defined using any of the Hadoop configuration resources.

  3. Sets spark.sql.warehouse.dir or hive.metastore.warehouse.dir in the Hadoop configuration (of SparkContext)

    • If hive.metastore.warehouse.dir has been defined in any of the Hadoop configuration resources but spark.sql.warehouse.dir has not, spark.sql.warehouse.dir becomes the value of hive.metastore.warehouse.dir.

    warehousePath prints out the following INFO message to the logs:

    spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('[hiveWarehouseDir]').
    
    • Otherwise, the Hadoop configuration's hive.metastore.warehouse.dir is set to spark.sql.warehouse.dir

    warehousePath prints out the following INFO message to the logs:

    Setting hive.metastore.warehouse.dir ('[hiveWarehouseDir]') to the value of spark.sql.warehouse.dir ('[sparkWarehouseDir]').
    

Logging

Enable ALL logging level for org.apache.spark.sql.internal.SharedState logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.internal.SharedState=ALL

Refer to Logging.


Last update: 2020-11-26