Source — Streaming Source in Micro-Batch Stream Processing¶
Source is used in Micro-Batch Stream Processing.
For fault tolerance,
Source must be able to replay an arbitrary sequence of past data in a stream using a range of offsets. This is the assumption so Structured Streaming can achieve end-to-end exactly-once guarantees.
commit( end: Offset): Unit
Commits data up to the end offset (informs the source that Spark has completed processing all data for offsets less than or equal to the end offset and will only request offsets greater than the end offset in the future).
getBatch( start: Option[Offset], end: Offset): DataFrame
Generating a streaming
DataFrame with data between the start and end offsets
Start offset can be undefined (
None) to indicate that the batch should begin with the first record
Latest (maximum) offset of the source (or
None to denote no data)
Used when <
Schema of the data from this source
initialOffset throws an
initialOffset is part of the SparkDataStream abstraction.
deserializeOffset( json: String): OffsetV2
deserializeOffset throws an
deserializeOffset is part of the SparkDataStream abstraction.