Skip to content

DataSourceScanExec -- Leaf Physical Operators to Scan Over BaseRelation

DataSourceScanExec is the <> of <> that represent scans over <>.

NOTE: There are two <>, i.e. <> and <>, with a scan over data in HadoopFsRelation and generic BaseRelation relations, respectively.

DataSourceScanExec supports Java code generation (aka codegen)

[[contract]] [source, scala]

package org.apache.spark.sql.execution

trait DataSourceScanExec extends LeafExecNode with CodegenSupport { // only required vals and methods that have no implementation // the others follow def metadata: Map[String, String] val relation: BaseRelation val tableIdentifier: Option[TableIdentifier] }

.(Subset of) DataSourceScanExec Contract [cols="1,2",options="header",width="100%"] |=== | Property | Description

| metadata | [[metadata]] Metadata (as a collection of key-value pairs) that describes the scan when requested for the <>.

| relation | [[relation]] BaseRelation that is used in the <> and...FIXME

| tableIdentifier | [[tableIdentifier]] Optional TableIdentifier |===

NOTE: The prefix for variable names for DataSourceScanExec operators in a generated Java source code is scan.

[[nodeNamePrefix]] The default node name prefix is an empty string (that is used in the <>).

[[nodeName]] DataSourceScanExec uses the <> and the <> as the node name in the following format:

Scan [relation] [tableIdentifier]

[[implementations]] .DataSourceScanExecs [width="100%",cols="1,2",options="header"] |=== | DataSourceScanExec | Description

|[FileSourceScanExec] | [[FileSourceScanExec]]

|[RowDataSourceScanExec] | [[RowDataSourceScanExec]] |===

=== [[simpleString]] Simple (Basic) Text Node Description (in Query Plan Tree) -- simpleString Method

[source, scala]

simpleString: String

NOTE: simpleString is part of catalyst/[QueryPlan Contract] to give the simple text description of a TreeNode in a query plan tree.

simpleString creates a text representation of every key-value entry in the <>...FIXME

Internally, simpleString sorts the <> and concatenate the keys and the values (separated by the :). While doing so, simpleString <> in every value and abbreviates it to the first 100 characters.

simpleString uses Spark Core's Utils to truncatedString.

In the end, simpleString returns a text representation that is made up of the <>, the <>, the catalyst/[output] (schema attributes) and the <> and is of the following format:


[source, scala]

val scanExec = basicDataSourceScanExec scala> println(scanExec.simpleString) Scan line143.readiwiwiwiwiwiwiwiw$anon1@57d94b26 [] PushedFilters: [], ReadSchema: struct<>

def basicDataSourceScanExec = { import org.apache.spark.sql.catalyst.expressions.AttributeReference val output = Seq.empty[AttributeReference] val requiredColumnsIndex = output.indices import org.apache.spark.sql.sources.Filter val filters, handledFilters = Set.empty[Filter] import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.UnsafeRow val row: InternalRow = new UnsafeRow(0) val rdd: RDD[InternalRow] = sc.parallelize(row :: Nil)

import org.apache.spark.sql.sources.{BaseRelation, TableScan} val baseRelation: BaseRelation = new BaseRelation with TableScan { import org.apache.spark.sql.SQLContext val sqlContext: SQLContext = spark.sqlContext

import org.apache.spark.sql.types.StructType
val schema: StructType = new StructType()

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
def buildScan(): RDD[Row] = ???


val tableIdentifier = None import org.apache.spark.sql.execution.RowDataSourceScanExec RowDataSourceScanExec( output, requiredColumnsIndex, filters, handledFilters, rdd, baseRelation, tableIdentifier) }

=== [[verboseString]] verboseString Method

[source, scala]

verboseString: String

NOTE: verboseString is part of catalyst/[QueryPlan Contract] to...FIXME.

verboseString simply returns the <> in catalyst/[verboseString] (of the parent QueryPlan).

Text Representation of All Nodes in Tree

  verbose: Boolean,
  addSuffix: Boolean): String

treeString simply returns the <> in the text representation of all nodes (in query plan tree) (of the parent TreeNode).

treeString is part of the TreeNode abstraction.

=== [[redact]] Redacting Sensitive Information -- redact Internal Method

[source, scala]

redact(text: String): String


NOTE: redact is used when DataSourceScanExec is requested for the <>, <> and <> text representations.

Last update: 2020-11-13