EventTimeWatermarkExec Unary Physical Operator¶
EventTimeWatermarkExec is a unary physical operator that represents EventTimeWatermark logical operator at execution time.
A unary physical operator (
UnaryExecNode) is a physical operator with a single child physical operator.
EventTimeWatermarkExec takes the following to be created:
Attributefor event time (Spark SQL)
- Delay Interval (Spark SQL)
- Child Physical Operator (Spark SQL)
EventTimeWatermarkExec registers the EventTimeStatsAccum accumulator (with the current
Since the execution (data processing) happens on Spark executors, the only way to establish communication between the tasks (on the executors) and the driver is to use accumulator facility.
eventTimeStats is registered (with the current
EventTimeWatermarkExec is created.
eventTimeStats uses no name (unnamed accumulator).
eventTimeStats is used to transfer the statistics (maximum, minimum, average and update count) of the long values in the event-time watermark column to be used for the following:
ProgressReporteris requested for the most recent execution statistics (for
watermarkevent-time watermark statistics)
WatermarkTrackeris requested to updateWatermark
Executing Physical Operator¶
doExecute is part of the
SparkPlan (Spark SQL) abstraction.
doExecute executes the child physical operator and maps over the partitions (using
doExecute creates an unsafe projection (per partition) for the column with the event time in the output schema of the child physical operator. The unsafe projection is to extract event times from the (stream of) internal rows of the child physical operator.
The event time value is in seconds (not millis as the value is divided by
output is part of the
QueryPlan (Spark SQL) abstraction.
Check out Demo: Streaming Watermark with Aggregation in Append Output Mode to deep dive into the internals of Streaming Watermark.