Skip to content

DataSource

Tip

Learn more about DataSource in The Internals of Spark SQL online book.

Creating Streaming Source (Data Source V1)

createSource(
  metadataPath: String): Source

createSource creates a new instance of the data source class and branches off per the type:

createSource is used when MicroBatchExecution is requested for an analyzed logical plan.

StreamSourceProvider

For a StreamSourceProvider, createSource requests the StreamSourceProvider to create a source.

FileFormat

For a FileFormat, createSource creates a new FileStreamSource.

createSource throws an IllegalArgumentException when path option was not specified for a FileFormat data source:

'path' is not specified

Other Types

For any other data source type, createSource simply throws an UnsupportedOperationException:

Data source [className] does not support streamed reading

SourceInfo

sourceInfo: SourceInfo

Metadata of a Source with the following:

  • Name (alias)
  • Schema
  • Partitioning columns

sourceInfo is initialized (lazily) using sourceSchema.

Lazy Value

sourceInfo is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and cached afterwards.

Used when:

Generating Metadata of Streaming Source

sourceSchema(): SourceInfo

sourceSchema creates a new instance of the data source class and branches off per the type:

sourceSchema is used when DataSource is requested for the SourceInfo.

StreamSourceProvider

For a StreamSourceProvider, sourceSchema requests the StreamSourceProvider for the name and schema (of the streaming source).

In the end, sourceSchema returns the name and the schema as part of SourceInfo (with partition columns unspecified).

FileFormat

For a FileFormat, sourceSchema...FIXME

Other Types

For any other data source type, sourceSchema simply throws an UnsupportedOperationException:

Data source [className] does not support streamed reading

Creating Streaming Sink

createSink(
  outputMode: OutputMode): Sink

createSink creates a streaming sink for StreamSinkProvider or FileFormat data sources.

Tip

Learn more about FileFormat in The Internals of Spark SQL online book.

createSink creates a new instance of the data source class and branches off per the type:

createSink throws a IllegalArgumentException when path option is not specified for a FileFormat data source:

'path' is not specified

createSink throws an AnalysisException when the given OutputMode is different from Append for a FileFormat data source:

Data source [className] does not support [outputMode] output mode

createSink throws an UnsupportedOperationException for unsupported data source formats:

Data source [className] does not support streamed writing

createSink is used when DataStreamWriter is requested to start a streaming query.