ColumnStat¶
[[creating-instance]] ColumnStat
holds the <
[[statistics]] .Column Statistics [cols="1,2",options="header",width="100%"] |=== | Name | Description
| [[distinctCount]] distinctCount
| Number of distinct values
| [[min]] min
| Minimum value
| [[max]] max
| Maximum value
| [[nullCount]] nullCount
| Number of null
values
| [[avgLen]] avgLen
| Average length of the values
| [[maxLen]] maxLen
| Maximum length of the values
| [[histogram]] histogram
| Histogram of values (as Histogram
which is empty by default) |===
ColumnStat
is computed (and <SparkSqlAstBuilder
spark-sql-SparkSqlAstBuilder.md#ANALYZE-TABLE[translates] to AnalyzeColumnCommand.md[AnalyzeColumnCommand] logical command).
[source, scala]¶
val cols = "id, p1, p2" val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $cols" spark.sql(analyzeTableSQL)
ColumnStat
may optionally hold the <ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS
SQL command generates column (equi-height) histograms.
NOTE: spark.sql.statistics.histogram.enabled
is off by default.
You can inspect the column statistics using spark-sql-cost-based-optimization.md#DESCRIBE-EXTENDED[DESCRIBE EXTENDED] SQL command.
scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name |info_value|
+--------------+----------+
|col_name |id |
|data_type |int |
|comment |NULL |
|min |0 |
|max |1 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |NULL | <-- no histogram (spark.sql.statistics.histogram.enabled off)
+--------------+----------+
ColumnStat
is part of the spark-sql-CatalogStatistics.md#colStats[statistics of a table].
[source, scala]¶
// Make sure that you ran ANALYZE TABLE (as described above) val db = spark.catalog.currentDatabase val tableName = "t1" val metadata = spark.sharedState.externalCatalog.getTable(db, tableName) val stats = metadata.stats.get
scala> :type stats org.apache.spark.sql.catalyst.catalog.CatalogStatistics
val colStats = stats.colStats scala> :type colStats Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
ColumnStat
is <
[source, scala]¶
scala> :type colStats Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
val colName = "p1"
val p1stats = colStats(colName) scala> :type p1stats org.apache.spark.sql.catalyst.plans.logical.ColumnStat
import org.apache.spark.sql.types.DoubleType val props = p1stats.toMap(colName, dataType = DoubleType) scala> println(props) Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)
ColumnStat
is <HiveExternalCatalog
is requested for restoring table statistics from properties (from a Hive Metastore).
[source, scala]¶
scala> :type props Map[String,String]
scala> println(props) Map(distinctCount -> 2, min -> 0.0, version -> 1, max -> 1.4, maxLen -> 8, avgLen -> 8, nullCount -> 0)
import org.apache.spark.sql.types.StructField val p1 = $"p1".double
import org.apache.spark.sql.catalyst.plans.logical.ColumnStat val colStatsOpt = ColumnStat.fromMap(table = "t1", field = p1, map = props)
scala> :type colStatsOpt Option[org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
ColumnStat
is also <JoinEstimation
is requested to estimateInnerOuterJoin for Inner
, Cross
, LeftOuter
, RightOuter
and FullOuter
joins.
[source, scala]¶
val tableName = "t1"
// Make the example reproducible import org.apache.spark.sql.catalyst.TableIdentifier val tid = TableIdentifier(tableName) val sessionCatalog = spark.sessionState.catalog sessionCatalog.dropTable(tid, ignoreIfNotExists = true, purge = true)
// CREATE TABLE t1 Seq((0, 0, "zero"), (1, 1, "one")). toDF("id", "p1", "p2"). write. saveAsTable(tableName)
// As we drop and create immediately we may face problems with unavailable partition files // Invalidate cache spark.sql(s"REFRESH TABLE $tableName")
// Use ANALYZE TABLE...FOR COLUMNS to compute column statistics // that saves them in a metastore (aka an external catalog) val df = spark.table(tableName) val allCols = df.columns.mkString(",") val analyzeTableSQL = s"ANALYZE TABLE t1 COMPUTE STATISTICS FOR COLUMNS $allCols" spark.sql(analyzeTableSQL)
// Fetch the table metadata (with column statistics) from a metastore val metastore = spark.sharedState.externalCatalog val db = spark.catalog.currentDatabase val tableMeta = metastore.getTable(db, table = tableName)
// The column statistics are part of the table statistics val colStats = tableMeta.stats.get.colStats
scala> :type colStats Map[String,org.apache.spark.sql.catalyst.plans.logical.ColumnStat]
scala> colStats.map { case (name, cs) => s"$name: $cs" }.foreach(println) // the output may vary id: ColumnStat(2,Some(0),Some(1),0,4,4,None) p1: ColumnStat(2,Some(0),Some(1),0,4,4,None) p2: ColumnStat(2,None,None,0,4,4,None)
NOTE: ColumnStat
does not support <Array[Byte]
) and string types.
=== [[toExternalString]] Converting Value to External/Java Representation (per Catalyst Data Type) -- toExternalString
Internal Method
[source, scala]¶
toExternalString(v: Any, colName: String, dataType: DataType): String¶
toExternalString
...FIXME
NOTE: toExternalString
is used exclusively when ColumnStat
is requested for <
=== [[supportsHistogram]] supportsHistogram
Method
[source, scala]¶
supportsHistogram(dataType: DataType): Boolean¶
supportsHistogram
...FIXME
NOTE: supportsHistogram
is used when...FIXME
=== [[toMap]] Converting ColumnStat to Properties (ColumnStat Serialization) -- toMap
Method
[source, scala]¶
toMap(colName: String, dataType: DataType): Map[String, String]¶
toMap
converts <
[[properties]] .ColumnStat.toMap's Properties [cols="1,2",options="header",width="100%"] |=== | Key | Value
| version
| 1
| distinctCount
| <
| nullCount
| <
| avgLen
| <
| maxLen
| <
| min
| <
| max
| <
| histogram
| Serialized version of <HistogramSerializer.serialize
) |===
NOTE: toMap
adds min
, max
, histogram
entries only if they are available.
NOTE: Interestingly, colName
and dataType
input parameters bring no value to toMap
itself, but merely allow for a more user-friendly error reporting when <min
and max
column statistics.
toMap
is used when HiveExternalCatalog
is requested for hive/converting table statistics to properties (before persisting them as part of table metadata in a Hive metastore).
=== [[fromMap]] Re-Creating Column Statistics from Properties (ColumnStat Deserialization) -- fromMap
Method
[source, scala]¶
fromMap(table: String, field: StructField, map: Map[String, String]): Option[ColumnStat]¶
fromMap
creates a ColumnStat
by fetching <map
.
fromMap
returns None
when recovering column statistics fails for whatever reason.
WARN Failed to parse column statistics for column [fieldName] in table [table]
NOTE: Interestingly, table
input parameter brings no value to fromMap
itself, but merely allows for a more user-friendly error reporting when parsing column statistics fails.
fromMap
is used when HiveExternalCatalog
is requested for hive/restoring table statistics from properties (from a Hive Metastore).
=== [[rowToColumnStat]] Creating Column Statistics from InternalRow (Result of Computing Column Statistics) -- rowToColumnStat
Method
[source, scala]¶
rowToColumnStat( row: InternalRow, attr: Attribute, rowCount: Long, percentiles: Option[ArrayData]): ColumnStat
rowToColumnStat
<ColumnStat
from the input row
and the following positions:
[start=0] . <
If the 6
th field is not empty, rowToColumnStat
uses it to create <
NOTE: rowToColumnStat
is used exclusively when AnalyzeColumnCommand
is AnalyzeColumnCommand.md#run[executed] (to AnalyzeColumnCommand.md#computeColumnStats[compute the statistics for specified columns]).
=== [[statExprs]] statExprs
Method
[source, scala]¶
statExprs( col: Attribute, conf: SQLConf, colPercentiles: AttributeMap[ArrayData]): CreateNamedStruct
statExprs
...FIXME
NOTE: statExprs
is used when...FIXME