CatalogStatistics -- Table Statistics From External Catalog (Metastore)¶
[[creating-instance]][[table-statistics]] CatalogStatistics
are table statistics that are stored in an external catalog:
- [[sizeInBytes]] Physical total size (in bytes)
- [[rowCount]] Estimated number of rows (aka row count)
- [[colStats]] Column statistics (i.e. column names and their spark-sql-ColumnStat.md[statistics])
[NOTE]¶
CatalogStatistics
is a "subset" of the statistics in Statistics (as there are no concepts of attributes and broadcast hint in metastore).
CatalogStatistics
are often stored in a Hive metastore and are referred as Hive statistics while Statistics
are the Spark statistics.¶
CatalogStatistics
can be converted to Spark statistics using <
CatalogStatistics
is <
-
AnalyzeColumnCommand.md#run[AnalyzeColumnCommand],
AlterTableAddPartitionCommand
andTruncateTableCommand
commands are executed (and store statistics in ExternalCatalog) -
CommandUtils
is requested for updating existing table statistics, the current statistics (if changed) -
HiveExternalCatalog
is requested for restoring Spark statistics from properties (from a Hive Metastore) -
hive/DetermineTableStats.md#apply[DetermineTableStats] and spark-sql-SparkOptimizer-PruneFileSourcePartitions.md[PruneFileSourcePartitions] logical optimizations are executed (i.e. applied to a logical plan)
-
HiveClientImpl
is requested for a table or partition hive/HiveClientImpl.md#readHiveStats[statistics from Hive's parameters]
[source, scala]¶
scala> :type spark.sessionState.catalog org.apache.spark.sql.catalyst.catalog.SessionCatalog
// Using higher-level interface to access CatalogStatistics // Make sure that you ran ANALYZE TABLE (as described above) val db = spark.catalog.currentDatabase val tableName = "t1" val metadata = spark.sharedState.externalCatalog.getTable(db, tableName) val stats = metadata.stats
scala> :type stats Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
// Using low-level internal SessionCatalog interface to access CatalogTables val tid = spark.sessionState.sqlParser.parseTableIdentifier(tableName) val metadata = spark.sessionState.catalog.getTempViewOrPermanentTableMetadata(tid) val stats = metadata.stats
scala> :type stats Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
[[simpleString]] CatalogStatistics
has a text representation.
[source, scala]¶
scala> :type stats Option[org.apache.spark.sql.catalyst.catalog.CatalogStatistics]
scala> stats.map(_.simpleString).foreach(println) 714 bytes, 2 rows
Converting Metastore Statistics to Spark Statistics¶
toPlanStats(
planOutput: Seq[Attribute],
cboEnabled: Boolean): Statistics
toPlanStats
converts the table statistics (from an external metastore) to Spark statistics.
With spark-sql-cost-based-optimization.md[cost-based optimization] enabled and <toPlanStats
creates a Statistics with the estimated total (output) size, <
NOTE: Cost-based optimization is enabled when spark.sql.cbo.enabled configuration property is turned on, i.e. true
, and is disabled by default.
Otherwise, when spark-sql-cost-based-optimization.md[cost-based optimization] is disabled, toPlanStats
creates a Statistics with just the mandatory <
CAUTION: FIXME Why does toPlanStats
compute sizeInBytes
differently per CBO?
[NOTE]¶
toPlanStats
does the reverse of HiveExternalCatalog.statsToProperties.¶
toPlanStats
is used when HiveTableRelation and LogicalRelation are requested for statistics.