CatalogTable¶
CatalogTable
is the specification (metadata) of a table.
CatalogTable
is stored in a SessionCatalog (session-scoped catalog of relational entities).
scala> :type spark.sessionState.catalog
org.apache.spark.sql.catalyst.catalog.SessionCatalog
// Using high-level user-friendly catalog interface
scala> spark.catalog.listTables.filter($"name" === "t1").show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
| t1| default| null| MANAGED| false|
+----+--------+-----------+---------+-----------+
// Using low-level internal SessionCatalog interface to access CatalogTables
val t1Tid = spark.sessionState.sqlParser.parseTableIdentifier("t1")
val t1Metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(t1Tid)
scala> :type t1Metadata
org.apache.spark.sql.catalyst.catalog.CatalogTable
CatalogTable
is <
-
SessionCatalog
is requested for a table metadata -
HiveClientImpl
is requested for hive/HiveClientImpl.md#getTableOption[looking up a table in a metastore] -
DataFrameWriter
is requested to create a table -
hive/InsertIntoHiveDirCommand.md[InsertIntoHiveDirCommand] logical command is executed
-
SparkSqlAstBuilder
does spark-sql-SparkSqlAstBuilder.md#visitCreateTable[visitCreateTable] and spark-sql-SparkSqlAstBuilder.md#visitCreateHiveTable[visitCreateHiveTable] -
CreateTableLikeCommand
logical command is executed -
CreateViewCommand
logical command is <> (and < >) -
CatalogImpl
is requested to createTable
[[simpleString]] The readable text representation of a CatalogTable
(aka simpleString
) is...FIXME
NOTE: simpleString
is used exclusively when ShowTablesCommand
logical command is <
[[toString]] CatalogTable
uses the following text representation (i.e. toString
)...FIXME
CatalogTable
is <
-
CatalogImpl
is requested to list the columns of a table -
FindDataSourceTable
logical evaluation rule is requested to readDataSourceTable (when executed for data source tables) -
CreateTableLikeCommand
logical command is executed -
DescribeTableCommand
logical command is requested to <> (when < >) -
<
> logical command is executed -
<
> and < > logical commands are executed -
CatalogTable
is requested to <> -
HiveExternalCatalog
is requested to doCreateTable, tableMetaToTableProps, doAlterTable, restoreHiveSerdeTable and restoreDataSourceTable -
HiveClientImpl
is requested to hive/HiveClientImpl.md#getTableOption[retrieve a table metadata if available]>> and hive/HiveClientImpl.md#toHiveTable[toHiveTable] -
hive/InsertIntoHiveTable.md[InsertIntoHiveTable] logical command is executed
-
DataFrameWriter
is requested to create a table (via saveAsTable) -
SparkSqlAstBuilder
is requested to <> and < >
Creating Instance¶
CatalogTable
takes the following to be created:
- [[identifier]]
TableIdentifier
- [[tableType]] <
> - [[storage]] spark-sql-CatalogStorageFormat.md[CatalogStorageFormat]
- [[schema]] Schema
- [[provider]] Name of the table provider (optional)
- [[partitionColumnNames]] Partition column names
- [[bucketSpec]] Optional <
> (default: None
) - [[owner]] Owner
- [[createTime]] Create time
- [[lastAccessTime]] Last access time
- [[createVersion]] Create version
- [[properties]] Properties
- [[stats]] Optional spark-sql-CatalogStatistics.md[table statistics]
- [[viewText]] Optional view text
- [[comment]] Optional comment
- [[unsupportedFeatures]] Unsupported features
- [[tracksPartitionsInCatalog]]
tracksPartitionsInCatalog
flag - [[schemaPreservesCase]]
schemaPreservesCase
flag - [[ignoredProperties]] Ignored properties
=== [[CatalogTableType]] Table Type
The type of a table (CatalogTableType
) can be one of the following:
EXTERNAL
for external tables (hive/HiveClientImpl.md#getTableOption[EXTERNAL_TABLE] in Hive)MANAGED
for managed tables (hive/HiveClientImpl.md#getTableOption[MANAGED_TABLE] in Hive)VIEW
for views (hive/HiveClientImpl.md#getTableOption[VIRTUAL_VIEW] in Hive)
CatalogTableType
is included when a TreeNode
is requested for a JSON representation for...FIXME
=== [[stats-metadata]] Table Statistics for Query Planning (Auto Broadcast Joins and Cost-Based Optimization)
You manage a table metadata using the Catalog interface. Among the management tasks is to get the <
[source, scala]¶
scala> t1Metadata.stats.foreach(println) CatalogStatistics(714,Some(2),Map(p1 -> ColumnStat(2,Some(0),Some(1),0,4,4,None), id -> ColumnStat(2,Some(0),Some(1),0,4,4,None)))
scala> t1Metadata.stats.map(_.simpleString).foreach(println) 714 bytes, 2 rows
NOTE: The <CatalogTable
is <
CAUTION: FIXME When are stats specified? What if there are not?
Unless <DataSource
creates a HadoopFsRelation
with the table size specified by spark.sql.defaultSizeInBytes internal property (default: Long.MaxValue
) for query planning of joins (and possibly to auto broadcast the table).
Internally, Spark alters table statistics using ExternalCatalog.doAlterTableStats.
Unless <HiveTableRelation
(and hive
provider) DetermineTableStats
logical resolution rule can compute the table size using HDFS (if spark.sql.statistics.fallBackToHdfs property is turned on) or assume spark.sql.defaultSizeInBytes (that effectively disables table broadcasting).
When requested to hive/HiveClientImpl.md#getTableOption[look up a table in a metastore], HiveClientImpl
hive/HiveClientImpl.md#readHiveStats[reads table or partition statistics directly from a Hive metastore].
You can use AnalyzeColumnCommand.md[AnalyzeColumnCommand], AnalyzePartitionCommand.md[AnalyzePartitionCommand], AnalyzeTableCommand.md[AnalyzeTableCommand] commands to record statistics in a catalog.
The table statistics can be automatically updated (after executing commands like AlterTableAddPartitionCommand
) when spark.sql.statistics.size.autoUpdate.enabled property is turned on.
You can use DESCRIBE
SQL command to show the histogram of a column if stored in a catalog.
=== [[dataSchema]] dataSchema
Method
[source, scala]¶
dataSchema: StructType¶
dataSchema
...FIXME
NOTE: dataSchema
is used when...FIXME
=== [[partitionSchema]] partitionSchema
Method
[source, scala]¶
partitionSchema: StructType¶
partitionSchema
...FIXME
NOTE: partitionSchema
is used when...FIXME
=== [[toLinkedHashMap]] Converting Table Specification to LinkedHashMap -- toLinkedHashMap
Method
[source, scala]¶
toLinkedHashMap: mutable.LinkedHashMap[String, String]¶
toLinkedHashMap
converts the table specification to a collection of pairs (LinkedHashMap[String, String]
) with the following fields and their values:
-
Database with the database of the <
> -
Table with the table of the <
> -
Owner with the <
> (if defined) -
Created Time with the <
> -
Created By with
Spark
and the <> -
Type with the name of the <
> -
Provider with the <
> (if defined) -
<
> (of the < > if defined) -
Comment with the <
> (if defined) -
View Text, View Default Database and View Query Output Columns for <
> -
Table Properties with the <
> (if not empty) -
Statistics with the <
> (if defined) -
<
> (of the < > if defined) -
Partition Provider with Catalog if the <
> flag is on -
Partition Columns with the <
> (if not empty) -
Schema with the <
> (if not empty)
[NOTE]¶
toLinkedHashMap
is used when:
DescribeTableCommand
is requested to DescribeTableCommand.md#describeFormattedTableInfo[describeFormattedTableInfo] (whenDescribeTableCommand
is requested to DescribeTableCommand.md#run[run] for a non-temporary table and the DescribeTableCommand.md#isExtended[isExtended] flag on)
* CatalogTable
is requested for either a <> or a <> text representation¶
=== [[database]] database
Method
[source, scala]¶
database: String¶
database
simply returns the database (of the <AnalysisException
:
table [identifier] did not specify database
NOTE: database
is used when...FIXME