Skip to content

CatalogImpl

CatalogImpl is the Catalog in Spark SQL that...FIXME

.CatalogImpl uses SessionCatalog (through SparkSession) image::images/spark-sql-CatalogImpl.png[align="center"]

NOTE: CatalogImpl is in org.apache.spark.sql.internal package.

=== [[createTable]] Creating Table -- createTable Method

[source, scala]

createTable( tableName: String, source: String, schema: StructType, options: Map[String, String]): DataFrame


createTable...FIXME

createTable is part of Catalog Contract to...FIXME.

=== [[getTable]] getTable Method

[source, scala]

getTable(tableName: String): Table getTable(dbName: String, tableName: String): Table


NOTE: getTable is part of Catalog Contract to...FIXME.

getTable...FIXME

=== [[getFunction]] getFunction Method

[source, scala]

getFunction( functionName: String): Function getFunction( dbName: String, functionName: String): Function


NOTE: getFunction is part of Catalog Contract to...FIXME.

getFunction...FIXME

=== [[functionExists]] functionExists Method

[source, scala]

functionExists( functionName: String): Boolean functionExists( dbName: String, functionName: String): Boolean


NOTE: functionExists is part of Catalog Contract to...FIXME.

functionExists...FIXME

=== [[cacheTable]] Caching Table or View In-Memory -- cacheTable Method

[source, scala]

cacheTable(tableName: String): Unit

Internally, cacheTable first SparkSession.md#table[creates a DataFrame for the table] followed by requesting CacheManager to cache it.

NOTE: cacheTable uses the SparkSession.md#sharedState[session-scoped SharedState] to access the CacheManager.

NOTE: cacheTable is part of Catalog contract.

=== [[clearCache]] Removing All Cached Tables From In-Memory Cache -- clearCache Method

[source, scala]

clearCache(): Unit

clearCache requests CacheManager to remove all cached tables from in-memory cache.

NOTE: clearCache is part of Catalog contract.

=== [[createExternalTable]] Creating External Table From Path -- createExternalTable Method

[source, scala]

createExternalTable(tableName: String, path: String): DataFrame createExternalTable(tableName: String, path: String, source: String): DataFrame createExternalTable( tableName: String, source: String, options: Map[String, String]): DataFrame createExternalTable( tableName: String, source: String, schema: StructType, options: Map[String, String]): DataFrame


createExternalTable creates an external table tableName from the given path and returns the corresponding DataFrame.

[source, scala]

import org.apache.spark.sql.SparkSession val spark: SparkSession = ...

val readmeTable = spark.catalog.createExternalTable("readme", "README.md", "text") readmeTable: org.apache.spark.sql.DataFrame = [value: string]

scala> spark.catalog.listTables.filter(_.name == "readme").show +------+--------+-----------+---------+-----------+ | name|database|description|tableType|isTemporary| +------+--------+-----------+---------+-----------+ |readme| default| null| EXTERNAL| false| +------+--------+-----------+---------+-----------+

scala> sql("select count(*) as count from readme").show(false) +-----+ |count| +-----+ |99 | +-----+


The source input parameter is the name of the data source provider for the table, e.g. parquet, json, text. If not specified, createExternalTable uses spark.sql.sources.default setting to know the data source format.

NOTE: source input parameter must not be hive as it leads to a AnalysisException.

createExternalTable sets the mandatory path option when specified explicitly in the input parameter list.

createExternalTable parses tableName into TableIdentifier (using sql/SparkSqlParser.md[SparkSqlParser]). It creates a CatalogTable and then SessionState.md#executePlan[executes] (by toRDD) a CreateTable.md[CreateTable] logical plan. The result DataFrame is a Dataset[Row] with the QueryExecution after executing SubqueryAlias.md[SubqueryAlias] logical plan and RowEncoder.

.CatalogImpl.createExternalTable image::images/spark-sql-CatalogImpl-createExternalTable.png[align="center"]

NOTE: createExternalTable is part of Catalog contract.

=== [[listTables]] Listing Tables in Database (as Dataset) -- listTables Method

[source, scala]

listTables(): Dataset[Table] listTables(dbName: String): Dataset[Table]


NOTE: listTables is part of Catalog Contract.

Internally, listTables requests <> to list all tables in the specified dbName database and <>.

In the end, listTables <> with the tables.

=== [[listColumns]] Listing Columns of Table (as Dataset) -- listColumns Method

[source, scala]

listColumns(tableName: String): Dataset[Column] listColumns(dbName: String, tableName: String): Dataset[Column]


NOTE: listColumns is part of Catalog Contract.

listColumns requests <> for the table metadata.

listColumns takes the schema from the table metadata and creates a Column for every field (with the optional comment as the description).

In the end, listColumns <> with the columns.

=== [[makeTable]] Converting TableIdentifier to Table -- makeTable Internal Method

[source, scala]

makeTable(tableIdent: TableIdentifier): Table

makeTable creates a Table using the input TableIdentifier and the table metadata (from the current SessionCatalog) if available.

NOTE: makeTable uses <> to access SessionState.md#sessionState[SessionState] that is then used to access SessionState.md#catalog[SessionCatalog].

NOTE: makeTable is used when CatalogImpl is requested to <> or <>.

=== [[makeDataset]] Creating Dataset from DefinedByConstructorParams Data -- makeDataset Method

[source, scala]

makeDatasetT <: DefinedByConstructorParams: Dataset[T]


makeDataset creates an ExpressionEncoder (from DefinedByConstructorParams) and encodes elements of the input data to internal binary rows.

makeDataset then creates a LocalRelation.md#creating-instance[LocalRelation] logical operator. makeDataset requests SessionState to SessionState.md#executePlan[execute the plan] and Dataset.md#creating-instance[creates] the result Dataset.

NOTE: makeDataset is used when CatalogImpl is requested to <>, <>, <> and <>

=== [[refreshTable]] Refreshing Analyzed Logical Plan of Table Query and Re-Caching It -- refreshTable Method

[source, scala]

refreshTable(tableName: String): Unit

refreshTable requests SessionState for the SessionState.md#sqlParser[SQL parser] to spark-sql-ParserInterface.md#parseTableIdentifier[parse a TableIdentifier given the table name].

NOTE: refreshTable uses <> to access the SparkSession.md#sessionState[SessionState].

refreshTable requests <> for the table metadata.

refreshTable then SparkSession.md#table[creates a DataFrame for the table name].

For a temporary or persistent VIEW table, refreshTable requests the analyzed logical plan of the DataFrame (for the table) to spark-sql-LogicalPlan.md#refresh[refresh] itself.

For other types of table, refreshTable requests <> for refreshing the table metadata (i.e. invalidating the table).

If the table <>, refreshTable requests CacheManager to uncache and cache the table DataFrame again.

NOTE: refreshTable uses <> to access the SparkSession.md#sharedState[SharedState] that is used to access CacheManager.

refreshTable is part of the Catalog abstraction.

=== [[refreshByPath]] refreshByPath Method

[source, scala]

refreshByPath(resourcePath: String): Unit

refreshByPath...FIXME

refreshByPath is part of the Catalog abstraction.

=== [[dropGlobalTempView]] dropGlobalTempView Method

[source, scala]

dropGlobalTempView( viewName: String): Boolean


dropGlobalTempView...FIXME

dropGlobalTempView is part of the Catalog abstraction.

=== [[listColumns-internal]] listColumns Internal Method

[source, scala]

listColumns(tableIdentifier: TableIdentifier): Dataset[Column]

listColumns...FIXME

NOTE: listColumns is used exclusively when CatalogImpl is requested to <>.


Last update: 2021-03-18
Back to top