Skip to content

MetadataLogFileIndex

MetadataLogFileIndex is a PartitioningAwareFileIndex of metadata log files (generated by FileStreamSink).

Tip

Learn more about PartitioningAwareFileIndex in The Internals of Spark SQL online book.

Creating Instance

MetadataLogFileIndex takes the following to be created:

  • SparkSession
  • Hadoop Path
  • Parameters (Map[String, String])
  • User-Defined Schema (Option[StructType])

MetadataLogFileIndex is created when:

  • DataSource is requested to resolveRelation (for FileFormat streaming data sources)
  • FileTable is requested for a PartitioningAwareFileIndex (for FileFormat streaming data sources)
  • FileStreamSource is requested to allFilesUsingMetadataLogFileIndex

While being created, MetadataLogFileIndex prints out the following INFO message to the logs (with the metadataDirectory):

Reading streaming file log from [metadataDirectory]

Metadata Directory

metadataDirectory: Path

metadataDirectory is a Hadoop Path of Metadata Directory.

metadataDirectory is a _spark_metadata directory in the given path.

metadataDirectory is used to create a FileStreamSinkLog.

FileStreamSinkLog

metadataLog: FileStreamSinkLog

metadataLog is a FileStreamSinkLog with the Metadata Directory.

metadataLog is used for metadata log files.

Metadata Log Files

allFilesFromLog: Array[FileStatus]

allFilesFromLog requests the FileStreamSinkLog for all files that are in turn requested for their representation as a Hadoop FileStatus.

allFilesFromLog is used for leafFiles and leafDirToChildrenFiles.

Leaf Files

leafFiles: mutable.LinkedHashMap[Path, FileStatus]

leafFiles...FIXME

leafFiles is part of the PartitioningAwareFileIndex abstraction (Spark SQL).

leafDirToChildrenFiles

leafDirToChildrenFiles: Map[Path, Array[FileStatus]]

leafDirToChildrenFiles...FIXME

leafDirToChildrenFiles is part of the PartitioningAwareFileIndex abstraction (Spark SQL).

Logging

Enable ALL logging level for org.apache.spark.sql.execution.streaming.MetadataLogFileIndex logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.execution.streaming.MetadataLogFileIndex=ALL

Refer to Logging.


Last update: 2020-10-27