WriteToContinuousDataSourceExec Unary Physical Operator¶
WriteToContinuousDataSourceExec is a unary physical operator that <
A unary physical operator (
UnaryExecNode) is a physical operator with a single <
Read up on https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SparkPlan.html[UnaryExecNode] (and physical operators in general) in https://bit.ly/spark-sql-internals[The Internals of Spark SQL] book.¶
WriteToContinuousDataSourceExec is <
DataSourceV2Strategy execution planning strategy is requested to plan a WriteToContinuousDataSource unary logical operator.
TIP: Read up on https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SparkStrategy-DataSourceV2Strategy.html[DataSourceV2Strategy Execution Planning Strategy] in https://bit.ly/spark-sql-internals[The Internals of Spark SQL] book.
WriteToContinuousDataSourceExec takes the following to be created:
- [[query]][[child]] Child physical operator (
WriteToContinuousDataSourceExec uses empty output schema (which is exactly to say that no output is expected whatsoever).
[[logging]] [TIP] ==== Enable
ALL logging level for
org.apache.spark.sql.execution.streaming.continuous.WriteToContinuousDataSourceExec to see what happens inside.
Add the following line to
Refer to <
=== [[doExecute]] Executing Physical Operator (Generating RDD[InternalRow]) --
doExecute is part of
SparkPlan Contract to generate the runtime representation of an physical operator as a distributed computation over internal binary rows on Apache Spark (i.e.
doExecute requests the <
doExecute then requests the <
RDD[InternalRow]) and uses the
RDD[InternalRow] and the
DataWriterFactory to create a <
doExecute prints out the following INFO message to the logs:
Start processing data source writer: [writer]. The input RDD has [partitions] partitions.
doExecute requests the
EpochCoordinatorRef helper for a <
NOTE: The <
doExecute requests the EpochCoordinator RPC endpoint reference to send out a <
In the end,
doExecute requests the
ContinuousWriteRDD to collect (which simply runs a Spark job on all partitions in an RDD and returns the results in an array).
NOTE: Requesting the
ContinuousWriteRDD to collect is how a Spark job is ran that in turn runs tasks (one per partition) that are described by the <
collect is meant to run a Spark job (with tasks on executors), it's in the discretion of the tasks themselves to decide when to finish (so if they want to run indefinitely, so be it). What a clever trick!