Skip to content

DataSourceRDD

DataSourceRDD is a RDD[InternalRow] that acts as a thin adapter between Spark SQL's DataSource V2 and Spark Core's RDD API.

DataSourceRDD uses DataSourceRDDPartition for the partitions (that is a mere wrapper of the InputPartitions).

Creating Instance

DataSourceRDD takes the following to be created:

DataSourceRDD is created when:

  • BatchScanExec physical operator is requested for an input RDD

  • MicroBatchScanExec physical operator is requested for an inputRDD

Preferred Locations For Partition

getPreferredLocations(
    split: Partition): Seq[String]

getPreferredLocations simply requests the given split DataSourceRDDPartition for the InputPartition that in turn is requested for the preferred locations.

getPreferredLocations is part of Spark Core's RDD abstraction.

RDD Partitions

getPartitions: Array[Partition]

getPartitions simply creates a DataSourceRDDPartition for every <>.

getPartitions is part of Spark Core's RDD abstraction.

Computing Partition (in TaskContext)

compute(
    split: Partition,
    context: TaskContext): Iterator[T]

compute...FIXME

compute is part of Spark Core's RDD abstraction.

Back to top