Skip to content

CatalogTable

CatalogTable is the specification (metadata) of a table.

CatalogTable is stored in a SessionCatalog (session-scoped catalog of relational entities).

scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog

// Using high-level user-friendly catalog interface
scala> spark.catalog.listTables.filter($"name" === "t1").show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
|  t1| default|       null|  MANAGED|      false|
+----+--------+-----------+---------+-----------+

// Using low-level internal SessionCatalog interface to access CatalogTables
val t1Tid = spark.sessionState.sqlParser.parseTableIdentifier("t1")
val t1Metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(t1Tid)
scala> :type t1Metadata
org.apache.spark.sql.catalyst.catalog.CatalogTable

CatalogTable is <> when:

  • SessionCatalog is requested for a table metadata

  • HiveClientImpl is requested for hive/HiveClientImpl.md#getTableOption[looking up a table in a metastore]

  • DataFrameWriter is requested to create a table

  • hive/InsertIntoHiveDirCommand.md[InsertIntoHiveDirCommand] logical command is executed

  • SparkSqlAstBuilder does spark-sql-SparkSqlAstBuilder.md#visitCreateTable[visitCreateTable] and spark-sql-SparkSqlAstBuilder.md#visitCreateHiveTable[visitCreateHiveTable]

  • CreateTableLikeCommand logical command is executed

  • CreateViewCommand logical command is <> (and <>)

  • CatalogImpl is requested to createTable

[[simpleString]] The readable text representation of a CatalogTable (aka simpleString) is...FIXME

NOTE: simpleString is used exclusively when ShowTablesCommand logical command is <> (with a partition specification).

[[toString]] CatalogTable uses the following text representation (i.e. toString)...FIXME

CatalogTable is <> with the optional <> that is used for the following:

  • CatalogImpl is requested to list the columns of a table

  • FindDataSourceTable logical evaluation rule is requested to readDataSourceTable (when executed for data source tables)

  • CreateTableLikeCommand logical command is executed

  • DescribeTableCommand logical command is requested to <> (when <>)

  • <> logical command is executed

  • <> and <> logical commands are executed

  • CatalogTable is requested to <>

  • HiveExternalCatalog is requested to doCreateTable, tableMetaToTableProps, doAlterTable, restoreHiveSerdeTable and restoreDataSourceTable

  • HiveClientImpl is requested to hive/HiveClientImpl.md#getTableOption[retrieve a table metadata if available]>> and hive/HiveClientImpl.md#toHiveTable[toHiveTable]

  • hive/InsertIntoHiveTable.md[InsertIntoHiveTable] logical command is executed

  • DataFrameWriter is requested to create a table (via saveAsTable)

  • SparkSqlAstBuilder is requested to <> and <>

Creating Instance

CatalogTable takes the following to be created:

  • [[identifier]] TableIdentifier
  • [[tableType]] <>
  • [[storage]] spark-sql-CatalogStorageFormat.md[CatalogStorageFormat]
  • [[schema]] Schema
  • [[provider]] Name of the table provider (optional)
  • [[partitionColumnNames]] Partition column names
  • [[bucketSpec]] Optional <> (default: None)
  • [[owner]] Owner
  • [[createTime]] Create time
  • [[lastAccessTime]] Last access time
  • [[createVersion]] Create version
  • [[properties]] Properties
  • [[stats]] Optional spark-sql-CatalogStatistics.md[table statistics]
  • [[viewText]] Optional view text
  • [[comment]] Optional comment
  • [[unsupportedFeatures]] Unsupported features
  • [[tracksPartitionsInCatalog]] tracksPartitionsInCatalog flag
  • [[schemaPreservesCase]] schemaPreservesCase flag
  • [[ignoredProperties]] Ignored properties

=== [[CatalogTableType]] Table Type

The type of a table (CatalogTableType) can be one of the following:

  • EXTERNAL for external tables (hive/HiveClientImpl.md#getTableOption[EXTERNAL_TABLE] in Hive)
  • MANAGED for managed tables (hive/HiveClientImpl.md#getTableOption[MANAGED_TABLE] in Hive)
  • VIEW for views (hive/HiveClientImpl.md#getTableOption[VIRTUAL_VIEW] in Hive)

CatalogTableType is included when a TreeNode is requested for a JSON representation for...FIXME

=== [[stats-metadata]] Table Statistics for Query Planning (Auto Broadcast Joins and Cost-Based Optimization)

You manage a table metadata using the Catalog interface. Among the management tasks is to get the <> of a table (that are used for spark-sql-cost-based-optimization.md[cost-based query optimization]).

[source, scala]

scala> t1Metadata.stats.foreach(println) CatalogStatistics(714,Some(2),Map(p1 -> ColumnStat(2,Some(0),Some(1),0,4,4,None), id -> ColumnStat(2,Some(0),Some(1),0,4,4,None)))

scala> t1Metadata.stats.map(_.simpleString).foreach(println) 714 bytes, 2 rows


NOTE: The <> are optional when CatalogTable is <>.

CAUTION: FIXME When are stats specified? What if there are not?

Unless <> are available in a table metadata (in a catalog) for a non-streaming file data source table, DataSource creates a HadoopFsRelation with the table size specified by spark.sql.defaultSizeInBytes internal property (default: Long.MaxValue) for query planning of joins (and possibly to auto broadcast the table).

Internally, Spark alters table statistics using ExternalCatalog.doAlterTableStats.

Unless <> are available in a table metadata (in a catalog) for HiveTableRelation (and hive provider) DetermineTableStats logical resolution rule can compute the table size using HDFS (if spark.sql.statistics.fallBackToHdfs property is turned on) or assume spark.sql.defaultSizeInBytes (that effectively disables table broadcasting).

When requested to hive/HiveClientImpl.md#getTableOption[look up a table in a metastore], HiveClientImpl hive/HiveClientImpl.md#readHiveStats[reads table or partition statistics directly from a Hive metastore].

You can use AnalyzeColumnCommand.md[AnalyzeColumnCommand], AnalyzePartitionCommand.md[AnalyzePartitionCommand], AnalyzeTableCommand.md[AnalyzeTableCommand] commands to record statistics in a catalog.

The table statistics can be automatically updated (after executing commands like AlterTableAddPartitionCommand) when spark.sql.statistics.size.autoUpdate.enabled property is turned on.

You can use DESCRIBE SQL command to show the histogram of a column if stored in a catalog.

=== [[dataSchema]] dataSchema Method

[source, scala]

dataSchema: StructType

dataSchema...FIXME

NOTE: dataSchema is used when...FIXME

=== [[partitionSchema]] partitionSchema Method

[source, scala]

partitionSchema: StructType

partitionSchema...FIXME

NOTE: partitionSchema is used when...FIXME

=== [[toLinkedHashMap]] Converting Table Specification to LinkedHashMap -- toLinkedHashMap Method

[source, scala]

toLinkedHashMap: mutable.LinkedHashMap[String, String]

toLinkedHashMap converts the table specification to a collection of pairs (LinkedHashMap[String, String]) with the following fields and their values:

  • Database with the database of the <>

  • Table with the table of the <>

  • Owner with the <> (if defined)

  • Created Time with the <>

  • Created By with Spark and the <>

  • Type with the name of the <>

  • Provider with the <> (if defined)

  • <> (of the <> if defined)

  • Comment with the <> (if defined)

  • View Text, View Default Database and View Query Output Columns for <>

  • Table Properties with the <> (if not empty)

  • Statistics with the <> (if defined)

  • <> (of the <> if defined)

  • Partition Provider with Catalog if the <> flag is on

  • Partition Columns with the <> (if not empty)

  • Schema with the <> (if not empty)

[NOTE]

toLinkedHashMap is used when:

  • DescribeTableCommand is requested to DescribeTableCommand.md#describeFormattedTableInfo[describeFormattedTableInfo] (when DescribeTableCommand is requested to DescribeTableCommand.md#run[run] for a non-temporary table and the DescribeTableCommand.md#isExtended[isExtended] flag on)

* CatalogTable is requested for either a <> or a <> text representation

=== [[database]] database Method

[source, scala]

database: String

database simply returns the database (of the <>) or throws an AnalysisException:

table [identifier] did not specify database

NOTE: database is used when...FIXME


Last update: 2020-11-16