Skip to content

CatalogStatistics -- Table Statistics From External Catalog (Metastore)

[[creating-instance]][[table-statistics]] CatalogStatistics are table statistics that are stored in an external catalog:

  • [[sizeInBytes]] Physical total size (in bytes)
  • [[rowCount]] Estimated number of rows (aka row count)
  • [[colStats]] Column statistics (i.e. column names and their[statistics])


CatalogStatistics is a "subset" of the statistics in Statistics (as there are no concepts of attributes and broadcast hint in metastore).

CatalogStatistics are often stored in a Hive metastore and are referred as Hive statistics while Statistics are the Spark statistics.

CatalogStatistics can be converted to Spark statistics using <> method.

CatalogStatistics is <> when:

  •[AnalyzeColumnCommand], AlterTableAddPartitionCommand and TruncateTableCommand commands are executed (and store statistics in ExternalCatalog)

  • CommandUtils is requested for updating existing table statistics, the current statistics (if changed)

  • HiveExternalCatalog is requested for restoring Spark statistics from properties (from a Hive Metastore)

  • hive/[DetermineTableStats] and[PruneFileSourcePartitions] logical optimizations are executed (i.e. applied to a logical plan)

  • HiveClientImpl is requested for a table or partition hive/[statistics from Hive's parameters]

[source, scala]

scala> :type spark.sessionState.catalog org.apache.spark.sql.catalyst.catalog.SessionCatalog

// Using higher-level interface to access CatalogStatistics // Make sure that you ran ANALYZE TABLE (as described above) val db = spark.catalog.currentDatabase val tableName = "t1" val metadata = spark.sharedState.externalCatalog.getTable(db, tableName) val stats = metadata.stats

scala> :type stats Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

// Using low-level internal SessionCatalog interface to access CatalogTables val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName) val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid) val stats = metadata.stats

scala> :type stats Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

[[simpleString]] CatalogStatistics has a text representation.

[source, scala]

scala> :type stats Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]

scala> 714 bytes, 2 rows

Converting Metastore Statistics to Spark Statistics

  planOutput: Seq[Attribute],
  cboEnabled: Boolean): Statistics

toPlanStats converts the table statistics (from an external metastore) to Spark statistics.

With[cost-based optimization] enabled and <> statistics available, toPlanStats creates a Statistics with the estimated total (output) size, <> and column statistics.

NOTE: Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. true, and is disabled by default.

Otherwise, when[cost-based optimization] is disabled, toPlanStats creates a Statistics with just the mandatory <>.

CAUTION: FIXME Why does toPlanStats compute sizeInBytes differently per CBO?


toPlanStats does the reverse of HiveExternalCatalog.statsToProperties.

toPlanStats is used when HiveTableRelation and LogicalRelation are requested for statistics.

Last update: 2020-11-07