DataSource¶
Tip
Learn more about DataSource in The Internals of Spark SQL online book.
Creating Streaming Source (Data Source V1)¶
createSource(
metadataPath: String): Source
createSource
creates a new instance of the data source class and branches off per the type:
createSource
is used when MicroBatchExecution
is requested for an analyzed logical plan.
StreamSourceProvider¶
For a StreamSourceProvider, createSource
requests the StreamSourceProvider
to create a source.
FileFormat¶
For a FileFormat
, createSource
creates a new FileStreamSource.
createSource
throws an IllegalArgumentException
when path
option was not specified for a FileFormat
data source:
'path' is not specified
Other Types¶
For any other data source type, createSource
simply throws an UnsupportedOperationException
:
Data source [className] does not support streamed reading
SourceInfo¶
sourceInfo: SourceInfo
Metadata of a Source with the following:
- Name (alias)
- Schema
- Partitioning columns
sourceInfo
is initialized (lazily) using sourceSchema.
Lazy Value
sourceInfo
is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and cached afterwards.
Used when:
-
DataSource
is requested to create a streaming source for a File-Based Data Source (whenMicroBatchExecution
is requested to initialize the analyzed logical plan) -
StreamingRelation
utility is used to create a StreamingRelation (whenDataStreamReader
is requested for a streaming query)
Generating Metadata of Streaming Source¶
sourceSchema(): SourceInfo
sourceSchema
creates a new instance of the data source class and branches off per the type:
sourceSchema
is used when DataSource
is requested for the SourceInfo.
StreamSourceProvider¶
For a StreamSourceProvider, sourceSchema
requests the StreamSourceProvider
for the name and schema (of the streaming source).
In the end, sourceSchema
returns the name and the schema as part of SourceInfo
(with partition columns unspecified).
FileFormat¶
For a FileFormat
, sourceSchema
...FIXME
Other Types¶
For any other data source type, sourceSchema
simply throws an UnsupportedOperationException
:
Data source [className] does not support streamed reading
Creating Streaming Sink¶
createSink(
outputMode: OutputMode): Sink
createSink
creates a streaming sink for StreamSinkProvider or FileFormat
data sources.
Tip
Learn more about FileFormat in The Internals of Spark SQL online book.
createSink
creates a new instance of the data source class and branches off per the type:
-
For a StreamSinkProvider,
createSink
simply delegates the call and requests it to create a streaming sink -
For a
FileFormat
,createSink
creates a FileStreamSink whenpath
option is specified and the output mode is Append.
createSink
throws a IllegalArgumentException
when path
option is not specified for a FileFormat
data source:
'path' is not specified
createSink
throws an AnalysisException
when the given OutputMode is different from Append for a FileFormat
data source:
Data source [className] does not support [outputMode] output mode
createSink
throws an UnsupportedOperationException
for unsupported data source formats:
Data source [className] does not support streamed writing
createSink
is used when DataStreamWriter
is requested to start a streaming query.