ColumnarBatch¶
ColumnarBatch
allows to work with multiple ColumnVectors as a row-wise table.
Example¶
import org.apache.spark.sql.types._
val schema = new StructType()
.add("intCol", IntegerType)
.add("doubleCol", DoubleType)
.add("intCol2", IntegerType)
.add("string", BinaryType)
val capacity = 4 * 1024 // 4k
import org.apache.spark.memory.MemoryMode
import org.apache.spark.sql.execution.vectorized.OnHeapColumnVector
val columns = schema.fields.map { field =>
new OnHeapColumnVector(capacity, field.dataType)
}
import org.apache.spark.sql.vectorized.ColumnarBatch
val batch = new ColumnarBatch(columns.toArray)
// Add a row [1, 1.1, NULL]
columns(0).putInt(0, 1)
columns(1).putDouble(0, 1.1)
columns(2).putNull(0)
columns(3).putByteArray(0, "Hello".getBytes(java.nio.charset.StandardCharsets.UTF_8))
batch.setNumRows(1)
assert(batch.getRow(0).numFields == 4)
Creating Instance¶
ColumnarBatch
takes the following to be created:
- ColumnVectors
- Number of Rows
ColumnarBatch
immediately creates the internal MutableColumnarRow
.
ColumnarBatch
is created when:
RowToColumnarExec
unary physical operator is requested todoExecuteColumnar
- InMemoryTableScanExec leaf physical operator is requested for a RDD[ColumnarBatch]
MapInPandasExec
unary physical operator is requested todoExecute
OrcColumnarBatchReader
andVectorizedParquetRecordReader
are requested toinitBatch
PandasGroupUtils
utility is requested toexecutePython
ArrowConverters
utility is requested tofromBatchIterator
=== [[rowIterator]] Iterator Over InternalRows (in Batch) -- rowIterator
Method
[source, java]¶
Iterator rowIterator()¶
rowIterator
...FIXME
[NOTE]¶
rowIterator
is used when:
-
ArrowConverters
is requested tofromBatchIterator
-
AggregateInPandasExec
,WindowInPandasExec
, andFlatMapGroupsInPandasExec
physical operators are requested to execute (doExecute
)
* ArrowEvalPythonExec
physical operator is requested to evaluate
¶
=== [[setNumRows]] Specifying Number of Rows (in Batch) -- setNumRows
Method
[source, java]¶
void setNumRows(int numRows)¶
In essence, setNumRows
resets the batch and makes it available for reuse.
Internally, setNumRows
simply sets the <numRows
.
setNumRows
is used when:
-
OrcColumnarBatchReader
is requested tonextBatch
-
VectorizedParquetRecordReader
is requested to nextBatch (whenVectorizedParquetRecordReader
is requested to nextKeyValue) -
ColumnVectorUtils
is requested totoBatch
(for testing only) -
ArrowConverters
is requested tofromBatchIterator
-
InMemoryTableScanExec
physical operator is requested to <> -
ArrowPythonRunner is requested for a
ReaderIterator
(newReaderIterator
)