Skip to content


HadoopFsRelation is a BaseRelation and FileRelation.

Creating Instance

HadoopFsRelation takes the following to be created:

HadoopFsRelation is created when:

Bucketing Specification

HadoopFsRelation can be given a bucketing specification when created.

The bucketing specification is defined for non-streaming file-based data sources and used for the following:

Files to Scan (Input Files)

inputFiles: Array[String]

inputFiles requests the FileIndex for the inputFiles.

inputFiles is part of the FileRelation abstraction.

Estimated Size

sizeInBytes: Long

sizeInBytes requests the FileIndex for the size and multiplies it by the value of spark.sql.sources.fileCompressionFactor configuration property.

sizeInBytes is part of the BaseRelation abstraction.

Human-Friendly Textual Representation

toString: String

toString is the following text based on the FileFormat:


// Demo the different cases when `HadoopFsRelation` is created

import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation}

// Example 1: spark.table for DataSource tables (provider != hive)
import org.apache.spark.sql.catalyst.TableIdentifier
val t1ID = TableIdentifier(tableName = "t1")
spark.sessionState.catalog.dropTable(name = t1ID, ignoreIfNotExists = true, purge = true)

val metadata = spark.sessionState.catalog.getTableMetadata(t1ID)
scala> println(metadata.provider.get)

assert(metadata.provider.get != "hive")

val q = spark.table("t1")
// Avoid dealing with UnresolvedRelations and SubqueryAliases
// Hence going stright for optimizedPlan
val plan1 = q.queryExecution.optimizedPlan

scala> println(plan1.numberedTreeString)
00 Relation[id#7L] parquet

val LogicalRelation(rel1, _, _, _) = plan1.asInstanceOf[LogicalRelation]
val hadoopFsRel = rel1.asInstanceOf[HadoopFsRelation]

// Example 2: with format as a `FileFormat`
val q ="")
val plan2 = q.queryExecution.logical

scala> println(plan2.numberedTreeString)
00 Relation[value#2] text

val LogicalRelation(relation, _, _, _) = plan2.asInstanceOf[LogicalRelation]
val hadoopFsRel = relation.asInstanceOf[HadoopFsRelation]

// Example 3: Bucketing specified
val tableName = "bucketed_4_id"
  .bucketBy(4, "id")

val q = spark.table(tableName)
// Avoid dealing with UnresolvedRelations and SubqueryAliases
// Hence going stright for optimizedPlan
val plan3 = q.queryExecution.optimizedPlan

scala> println(plan3.numberedTreeString)
00 Relation[id#52L] parquet

val LogicalRelation(rel3, _, _, _) = plan3.asInstanceOf[LogicalRelation]
val hadoopFsRel = rel3.asInstanceOf[HadoopFsRelation]
val bucketSpec = hadoopFsRel.bucketSpec.get

// Exercise 3: spark.table for Hive tables (provider == hive)

Last update: 2020-11-16