Offsets and Metadata Checkpointing¶
A streaming query can be started from scratch or from checkpoint (that gives fault-tolerance as the state is preserved even when a failure happens).
Stream execution engines use checkpoint location to resume stream processing and get start offsets to start query processing from.
StreamExecution
resumes (populates the start offsets) from the latest checkpointed offsets from the Write-Ahead Log (WAL) of Offsets that may have already been processed (and, if so, committed to the Offset Commit Log).
-
StreamProgress and StreamExecutions (committed and available offsets)
Micro-Batch Stream Processing¶
In <MicroBatchExecution
stream processing engine is requested to <MicroBatchExecution
is requested to <
The available offsets are then added to the committed offsets when the latest batch ID available (as described above) is exactly the latest batch ID committed to the Offset Commit Log when MicroBatchExecution
stream processing engine is requested to <
When a streaming query is started from scratch (with no checkpoint that has offsets in the Offset Write-Ahead Log), MicroBatchExecution
prints out the following INFO message:
Starting new streaming query.
When a streaming query is resumed (restarted) from a checkpoint with offsets in the Offset Write-Ahead Log, MicroBatchExecution
prints out the following INFO message:
Resuming at batch [currentBatchId] with committed offsets [committedOffsets] and available offsets [availableOffsets]
Every time MicroBatchExecution
is requested to <
When MicroBatchExecution
is requested to <MicroBatchExecution
requested to <
MicroBatchExecution
prints out the following TRACE message to the logs:
noDataBatchesEnabled = [noDataBatchesEnabled], lastExecutionRequiresAnotherBatch = [lastExecutionRequiresAnotherBatch], isNewDataAvailable = [isNewDataAvailable], shouldConstructNextBatch = [shouldConstructNextBatch]
With <MicroBatchExecution
commits (adds) the available offsets for the batch to the Write-Ahead Log (WAL) and prints out the following INFO message to the logs:
Committed offsets for batch [currentBatchId]. Metadata [offsetSeqMetadata]
When <MicroBatchExecution
requests every Source and <
In the end (of <MicroBatchExecution
commits (adds) the available offsets (to the <
MicroBatchExecution
prints out the following DEBUG message to the logs:
Completed batch [currentBatchId]
Limitations (Assumptions)¶
It is assumed that the order of streaming sources in a streaming query matches the order of the offsets of OffsetSeq (in offsetLog) and availableOffsets.
In other words, a streaming query can be modified and then restarted from a checkpoint (to maintain stream processing state) only when the number of streaming sources and their order are preserved across restarts.