Skip to content

FileFormat

FileFormat is an abstraction of data sources that can read and write data stored in files.

Contract

Building Data Reader

buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

Builds a Catalyst data reader (a function that reads a single PartitionedFile file in to produce InternalRows).

buildReader throws an UnsupportedOperationException by default (and should therefore be overriden to work):

buildReader is not supported for [this]

Used when FileFormat is requested to buildReaderWithPartitionValues.

Inferring Schema

inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

Infers the schema of the given files (as Hadoop FileStatuses) if supported. Otherwise, None should be returned.

Used when:

isSplitable

isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

Controls whether the format (under the given path as Hadoop Path) is splittable or not

Default: false

Used when FileSourceScanExec physical operator is requested to create an RDD for non-bucketed reads (when requested for the inputRDD)

Preparing Write

prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

Prepares a write job and returns an OutputWriterFactory

Used when FileFormatWriter utility is used to write out a query result

supportBatch

supportBatch(
  sparkSession: SparkSession,
  dataSchema: StructType): Boolean

Controls whether the format supports vectorized decoding (aka columnar batch) or not

Default: false

Used when:

supportDataType

supportDataType(
  dataType: DataType): Boolean

Controls whether this format supports the given DataType in read or write paths

Default: true (all data types are supported)

Used when DataSourceUtils is used to verifySchema

Vector Types

vectorTypes(
  requiredSchema: StructType,
  partitionSchema: StructType,
  sqlConf: SQLConf): Option[Seq[String]]

Defines the fully-qualified class names (types) of the concrete ColumnVectors for every column in the input requiredSchema and partitionSchema schemas (to use in columnar processing mode)

Default: None (undefined)

Used when FileSourceScanExec physical operator is requested for the vectorTypes

Implementations

Building Data Reader With Partition Values

buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues builds a data reader with partition column values appended.

Note

buildReaderWithPartitionValues is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function.

buildReaderWithPartitionValues builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]) that does the following:

  1. Creates a converter by requesting GenerateUnsafeProjection to generate an UnsafeProjection for the attributes of the input requiredSchema and partitionSchema

  2. Applies the data reader to a PartitionedFile and converts the result using the converter on the joined row with the partition column values appended.

buildReaderWithPartitionValues is used when FileSourceScanExec physical operator is requested for the inputRDD.


Last update: 2020-11-16