Skip to content

Configuration Properties

Configuration properties (aka settings) allow you to fine-tune a Spark SQL application.

Configuration properties are configured in a SparkSession while creating a new instance using config method (e.g. spark.sql.warehouse.dir).

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .master("local[*]")
  .appName("My Spark Application")
  .config("spark.sql.warehouse.dir", "c:/Temp") // <1>
  .getOrCreate

You can also set a property using SQL SET command.

assert(spark.conf.getOption("spark.sql.hive.metastore.version").isEmpty)

scala> spark.sql("SET spark.sql.hive.metastore.version=2.3.2").show(truncate = false)
+--------------------------------+-----+
|key                             |value|
+--------------------------------+-----+
|spark.sql.hive.metastore.version|2.3.2|
+--------------------------------+-----+

assert(spark.conf.get("spark.sql.hive.metastore.version") == "2.3.2")

spark.sql.adaptive.forceApply

(internal) When true (together with spark.sql.adaptive.enabled enabled), Spark will force apply adaptive query execution for all supported queries.

Default: false

Since: 3.0.0

Use SQLConf.ADAPTIVE_EXECUTION_FORCE_APPLY method to access the property (in a type-safe way).

spark.sql.adaptive.logLevel

(internal) Log level for adaptive execution logging of plan changes. The value can be TRACE, DEBUG, INFO, WARN or ERROR.

Default: DEBUG

Since: 3.0.0

Use SQLConf.adaptiveExecutionLogLevel method to access the current value.

spark.sql.adaptive.coalescePartitions.enabled

Controls coalescing shuffle partitions

When true and spark.sql.adaptive.enabled is enabled, Spark will coalesce contiguous shuffle partitions according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small tasks.

Default: true

Since: 3.0.0

Use SQLConf.coalesceShufflePartitionsEnabled method to access the current value.

spark.sql.adaptive.advisoryPartitionSizeInBytes

The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is enabled). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.

Default: 64MB

Since: 3.0.0

Fallback Property: spark.sql.adaptive.shuffle.targetPostShuffleInputSize

Use SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES to reference the name.

spark.sql.adaptive.coalescePartitions.minPartitionNum

The minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled.

Default: (undefined)

Since: 3.0.0

spark.sql.adaptive.coalescePartitions.initialPartitionNum

The initial number of shuffle partitions before coalescing.

By default it equals to spark.sql.shuffle.partitions. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled.

Default: (undefined)

Since: 3.0.0

spark.sql.adaptive.enabled

Enables Adaptive Query Execution

Default: false

Since: 1.6.0

Use SQLConf.adaptiveExecutionEnabled method to access the current value.

spark.sql.adaptive.fetchShuffleBlocksInBatch

(internal) Whether to fetch the contiguous shuffle blocks in batch. Instead of fetching blocks one by one, fetching contiguous shuffle blocks for the same map task in batch can reduce IO and improve performance. Note, multiple contiguous blocks exist in single "fetch request only happen when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled. This feature also depends on a relocatable serializer, the concatenation support codec in use and the new version shuffle fetch protocol.

Default: true

Since: 3.0.0

Use SQLConf.fetchShuffleBlocksInBatch method to access the current value.

spark.sql.adaptive.localShuffleReader.enabled

When true and spark.sql.adaptive.enabled is enabled, Spark tries to use local shuffle reader to read the shuffle data when the shuffle partitioning is not needed, for example, after converting sort-merge join to broadcast-hash join.

Default: true

Since: 3.0.0

spark.sql.adaptive.skewJoin.enabled

When true and spark.sql.adaptive.enabled is enabled, Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions.

Default: true

Since: 3.0.0

Use SQLConf.SKEW_JOIN_ENABLED to reference the property.

spark.sql.adaptive.skewJoin.skewedPartitionFactor

A partition is considered skewed if its size is larger than this factor multiplying the median partition size and also larger than spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes.

Default: 5

Since: 3.0.0

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes

A partition is considered skewed if its size in bytes is larger than this threshold and also larger than spark.sql.adaptive.skewJoin.skewedPartitionFactor multiplying the median partition size. Ideally this config should be set larger than spark.sql.adaptive.advisoryPartitionSizeInBytes.

Default: 256MB

Since: 3.0.0

spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin

(internal) A relation with a non-empty partition ratio (the number of non-empty partitions to all partitions) lower than this config will not be considered as the build side of a broadcast-hash join in Adaptive Query Execution regardless of the size.

This configuration only has an effect when spark.sql.adaptive.enabled is true.

Default: 0.2

Since: 3.0.0

Use SQLConf.nonEmptyPartitionRatioForBroadcastJoin method to access the current value.

spark.sql.analyzer.maxIterations

(internal) The max number of iterations the analyzer runs.

Default: 100

Since: 3.0.0

spark.sql.analyzer.failAmbiguousSelfJoin

(internal) When true, fail the Dataset query if it contains ambiguous self-join.

Default: true

Since: 3.0.0

spark.sql.ansi.enabled

When true, Spark tries to conform to the ANSI SQL specification:

  1. Spark will throw a runtime exception if an overflow occurs in any operation on integral/decimal field.
  2. Spark will forbid using the reserved keywords of ANSI SQL as identifiers in the SQL parser.

Default: false

Since: 3.0.0

spark.sql.codegen.wholeStage

(internal) Whether the whole stage (of multiple physical operators) will be compiled into a single Java method (true) or not (false).

Default: true

Use SQLConf.wholeStageEnabled method to access the current value.

spark.sql.codegen.methodSplitThreshold

(internal) The threshold of source-code splitting in the codegen. When the number of characters in a single Java function (without comment) exceeds the threshold, the function will be automatically split to multiple smaller ones. We cannot know how many bytecode will be generated, so use the code length as metric. When running on HotSpot, a function's bytecode should not go beyond 8KB, otherwise it will not be JITted; it also should not be too small, otherwise there will be many function calls.

Default: 1024

Since: 3.0.0

spark.sql.debug.maxToStringFields

Maximum number of fields of sequence-like entries can be converted to strings in debug output. Any elements beyond the limit will be dropped and replaced by a "... N more fields" placeholder.

Default: 25

Since: 3.0.0

Use SQLConf.maxToStringFields method to access the current value.

spark.sql.defaultCatalog

Name of the default catalog

Default: spark_catalog

Use SQLConf.DEFAULT_CATALOG to access the current value.

Since: 3.0.0

spark.sql.execution.arrow.pyspark.enabled

When true, make use of Apache Arrow for columnar data transfers in PySpark. This optimization applies to:

  1. pyspark.sql.DataFrame.toPandas
  2. pyspark.sql.SparkSession.createDataFrame when its input is a Pandas DataFrame

The following data types are unsupported: BinaryType, MapType, ArrayType of TimestampType, and nested StructType.

Default: false

Since: 3.0.0

spark.sql.execution.reuseSubquery

(internal) When true, the planner will try to find out duplicated subqueries and re-use them.

Default: true

Since: 3.0.0

spark.sql.execution.sortBeforeRepartition

(internal) When perform a repartition following a shuffle, the output row ordering would be nondeterministic. If some downstream stages fail and some tasks of the repartition stage retry, these tasks may generate different data, and that can lead to correctness issues. Turn on this config to insert a local sort before actually doing repartition to generate consistent repartition results. The performance of repartition() may go down since we insert extra local sort before it.

Default: true

Since: 2.1.4

Use SQLConf.sortBeforeRepartition method to access the current value.

spark.sql.execution.rangeExchange.sampleSizePerPartition

(internal) Number of points to sample per partition in order to determine the range boundaries for range partitioning, typically used in global sorting (without limit).

Default: 100

Since: 2.3.0

Use SQLConf.rangeExchangeSampleSizePerPartition method to access the current value.

spark.sql.execution.arrow.pyspark.fallback.enabled

When true, optimizations enabled by spark.sql.execution.arrow.pyspark.enabled will fallback automatically to non-optimized implementations if an error occurs.

Default: true

Since: 3.0.0

spark.sql.execution.arrow.sparkr.enabled

When true, make use of Apache Arrow for columnar data transfers in SparkR. This optimization applies to:

  1. createDataFrame when its input is an R DataFrame
  2. collect
  3. dapply
  4. gapply

The following data types are unsupported: FloatType, BinaryType, ArrayType, StructType and MapType.

Default: false

Since: 3.0.0

spark.sql.execution.pandas.udf.buffer.size

Same as ${BUFFER_SIZE.key} but only applies to Pandas UDF executions. If it is not set, the fallback is ${BUFFER_SIZE.key}. Note that Pandas execution requires more than 4 bytes. Lowering this value could make small Pandas UDF batch iterated and pipelined; however, it might degrade performance. See SPARK-27870.

Default: 65536

Since: 3.0.0

spark.sql.execution.pandas.convertToArrowArraySafely

(internal) When true, Arrow will perform safe type conversion when converting Pandas. Series to Arrow array during serialization. Arrow will raise errors when detecting unsafe type conversion like overflow. When false, disabling Arrow's type check and do type conversions anyway. This config only works for Arrow 0.11.0+.

Default: false

Since: 3.0.0

spark.sql.statistics.histogram.enabled

Enables generating histograms for ANALYZE TABLE SQL statement

Default: false

Equi-Height Histogram

Histograms can provide better estimation accuracy. Currently, Spark only supports equi-height histogram. Note that collecting histograms takes extra cost. For example, collecting column statistics usually takes only one table scan, but generating equi-height histogram will cause an extra table scan.

Use SQLConf.histogramEnabled method to access the current value.

spark.sql.session.timeZone

The ID of session-local timezone (e.g. "GMT", "America/Los_Angeles")

Default: Java's TimeZone.getDefault.getID

Use SQLConf.sessionLocalTimeZone method to access the current value.

spark.sql.sources.commitProtocolClass

(internal) Fully-qualified class name of the FileCommitProtocol

Default: SQLHadoopMapReduceCommitProtocol

Use SQLConf.fileCommitProtocolClass method to access the current value.

spark.sql.sources.ignoreDataLocality

(internal) When true, Spark will not fetch the block locations for each file on listing files. This speeds up file listing, but the scheduler cannot schedule tasks to take advantage of data locality. It can be particularly useful if data is read from a remote cluster so the scheduler could never take advantage of locality anyway.

Default: false

Since: 3.0.0

spark.sql.sources.validatePartitionColumns

(internal) When this option is set to true, partition column values will be validated with user-specified schema. If the validation fails, a runtime exception is thrown. When this option is set to false, the partition column value will be converted to null if it can not be casted to corresponding user-specified schema.

Default: true

Since: 3.0.0

spark.sql.sources.useV1SourceList

(internal) A comma-separated list of data source short names (DataSourceRegisters) or fully-qualified canonical class names of the data sources (TableProviders) for which DataSource V2 code path is disabled (and Data Source V1 code path used).

Default: avro,csv,json,kafka,orc,parquet,text

Since: 3.0.0

Used when:

spark.sql.storeAssignmentPolicy

When inserting a value into a column with different data type, Spark will perform type coercion. Currently, we support 3 policies for the type coercion rules: ANSI, legacy and strict. With ANSI policy, Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting string to int or double to boolean. With legacy policy, Spark allows the type coercion as long as it is a valid Cast, which is very loose. e.g. converting string to int or double to boolean is allowed. It is also the only behavior in Spark 2.x and it is compatible with Hive. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. converting double to int or decimal to double is not allowed.

Possible values: ANSI, LEGACY, STRICT

Default: ANSI

Since: 3.0.0

spark.sql.optimizer.inSetSwitchThreshold

(internal) Configures the max set size in InSet for which Spark will generate code with switch statements. This is applicable only to bytes, shorts, ints, dates.

Must be non-negative and less than or equal to 600.

Default: 400

Since: 3.0.0

spark.sql.optimizer.planChangeLog.level

(internal) Configures the log level for logging the change from the original plan to the new plan after a rule or batch is applied. The value can be TRACE, DEBUG, INFO, WARN or ERROR.

Default: TRACE

Since: 3.0.0

spark.sql.optimizer.planChangeLog.rules

(internal) Configures a list of rules to be logged in the optimizer, in which the rules are specified by their rule names and separated by comma.

Default: (undefined)

Since: 3.0.0

spark.sql.optimizer.planChangeLog.batches

(internal) Configures a list of batches to be logged in the optimizer, in which the batches are specified by their batch names and separated by comma.

Default: (undefined)

Since: 3.0.0

spark.sql.optimizer.dynamicPartitionPruning.enabled

When true (default), Spark SQL will generate predicate for partition column when used as a join key.

Default: true

Use SQLConf.dynamicPartitionPruningEnabled to access the current value.

Since: 3.0.0

spark.sql.optimizer.dynamicPartitionPruning.useStats

(internal) When true, distinct count statistics will be used for computing the data size of the partitioned table after dynamic partition pruning, in order to evaluate if it is worth adding an extra subquery as the pruning filter if broadcast reuse is not applicable.

Default: true

Since: 3.0.0

Use SQLConf.dynamicPartitionPruningUseStats method to access the current value.

spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio

(internal) When statistics are not available or configured not to be used, this config will be used as the fallback filter ratio for computing the data size of the partitioned table after dynamic partition pruning, in order to evaluate if it is worth adding an extra subquery as the pruning filter if broadcast reuse is not applicable.

Default: 0.5

Since: 3.0.0

Use SQLConf.dynamicPartitionPruningFallbackFilterRatio method to access the current value.

spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly

(internal) When true, dynamic partition pruning will only apply when the broadcast exchange of a broadcast hash join operation can be reused as the dynamic pruning filter.

Default: true

Since: 3.0.0

Use SQLConf.dynamicPartitionPruningReuseBroadcastOnly method to access the current value.

spark.sql.optimizer.nestedPredicatePushdown.supportedFileSources

(internal) A comma-separated list of data source short names or fully qualified data source implementation class names for which Spark tries to push down predicates for nested columns and/or names containing dots to data sources. This configuration is only effective with file-based data source in DSv1. Currently, Parquet implements both optimizations while ORC only supports predicates for names containing dots. The other data sources don't support this feature yet.

Default: parquet,orc

Since: 3.0.0

spark.sql.optimizer.serializer.nestedSchemaPruning.enabled

(internal) Prune nested fields from object serialization operator which are unnecessary in satisfying a query. This optimization allows object serializers to avoid executing unnecessary nested expressions.

Default: true

Since: 3.0.0

spark.sql.optimizer.expression.nestedPruning.enabled

(internal) Prune nested fields from expressions in an operator which are unnecessary in satisfying a query. Note that this optimization doesn't prune nested fields from physical data source scanning. For pruning nested fields from scanning, please use spark.sql.optimizer.nestedSchemaPruning.enabled config.

Default: true

Since: 3.0.0

spark.sql.orc.mergeSchema

When true, the Orc data source merges schemas collected from all data files, otherwise the schema is picked from a random data file.

Default: false

Since: 3.0.0

spark.sql.datetime.java8API.enabled

When true, java.time.Instant and java.time.LocalDate classes of Java 8 API are used as external types for Catalyst's TimestampType and DateType. When false, java.sql.Timestamp and java.sql.Date are used for the same purpose.

Default: false

Since: 3.0.0

spark.sql.sources.binaryFile.maxLength

(internal) The max length of a file that can be read by the binary file data source. Spark will fail fast and not attempt to read the file if its length exceeds this value. The theoretical max is Int.MaxValue, though VMs might implement a smaller max.

Default: Int.MaxValue

Since: 3.0.0

spark.sql.mapKeyDedupPolicy

The policy to deduplicate map keys in builtin function: CreateMap, MapFromArrays, MapFromEntries, StringToMap, MapConcat and TransformKeys. When EXCEPTION, the query fails if duplicated map keys are detected. When LAST_WIN, the map key that is inserted at last takes precedence.

Possible values: EXCEPTION, LAST_WIN

Default: EXCEPTION

Since: 3.0.0

spark.sql.maven.additionalRemoteRepositories

A comma-delimited string config of the optional additional remote Maven mirror repositories. This is only used for downloading Hive jars in IsolatedClientLoader if the default Maven Central repo is unreachable.

Default: https://maven-central.storage-download.googleapis.com/maven2/

Since: 3.0.0

spark.sql.maxPlanStringLength

Maximum number of characters to output for a plan string. If the plan is longer, further output will be truncated. The default setting always generates a full plan. Set this to a lower value such as 8k if plan strings are taking up too much memory or are causing OutOfMemory errors in the driver or UI processes.

Default: Integer.MAX_VALUE - 15

Since: 3.0.0

spark.sql.addPartitionInBatch.size

(internal) The number of partitions to be handled in one turn when use AlterTableAddPartitionCommand to add partitions into table. The smaller batch size is, the less memory is required for the real handler, e.g. Hive Metastore.

Default: 100

Since: 3.0.0

spark.sql.scriptTransformation.exitTimeoutInSeconds

(internal) Timeout for executor to wait for the termination of transformation script when EOF.

Default: 10 seconds

Since: 3.0.0

spark.sql.autoBroadcastJoinThreshold

Maximum size (in bytes) for a table that will be broadcast to all worker nodes when performing a join.

Default: 10L * 1024 * 1024 (10M)

If the size of the statistics of the logical plan of a table is at most the setting, the DataFrame is broadcast for join.

Negative values or 0 disable broadcasting.

Use SQLConf.autoBroadcastJoinThreshold method to access the current value.

spark.sql.avro.compression.codec

The compression codec to use when writing Avro data to disk

Default: snappy

The supported codecs are:

  • uncompressed
  • deflate
  • snappy
  • bzip2
  • xz

Use SQLConf.avroCompressionCodec method to access the current value.

spark.sql.broadcastTimeout

Timeout in seconds for the broadcast wait time in broadcast joins.

Default: 5 * 60

When negative, it is assumed infinite (i.e. Duration.Inf)

Use SQLConf.broadcastTimeout method to access the current value.

spark.sql.caseSensitive

(internal) Controls whether the query analyzer should be case sensitive (true) or not (false).

Default: false

It is highly discouraged to turn on case sensitive mode.

Use SQLConf.caseSensitiveAnalysis method to access the current value.

spark.sql.catalog.spark_catalog

A catalog implementation that will be used as the v2 interface to Spark's built-in v1 catalog: spark_catalog. This catalog shares its identifier namespace with the spark_catalog and must be consistent with it; for example, if a table can be loaded by the spark_catalog, this catalog must also return the table metadata. To delegate operations to the spark_catalog, implementations can extend 'CatalogExtension'.

Default: (undefined)

Since: 3.0.0

spark.sql.cbo.enabled

Enables Cost-Based Optimization (CBO) for estimation of plan statistics when true.

Default: false

Use SQLConf.cboEnabled method to access the current value.

spark.sql.cbo.joinReorder.enabled

Enables join reorder for cost-based optimization (CBO).

Default: false

Use SQLConf.joinReorderEnabled method to access the current value.

spark.sql.cbo.planStats.enabled

When true, the logical plan will fetch row counts and column statistics from catalog.

Default: false

Since: 3.0.0

spark.sql.cbo.starSchemaDetection

Enables join reordering based on star schema detection for cost-based optimization (CBO) in ReorderJoin logical plan optimization.

Default: false

Use SQLConf.starSchemaDetection method to access the current value.

spark.sql.codegen.aggregate.map.vectorized.enable

(internal) Enables vectorized aggregate hash map. This is for testing/benchmarking only.

Default: false

Since: 3.0.0

spark.sql.codegen.aggregate.splitAggregateFunc.enabled

(internal) When true, the code generator would split aggregate code into individual methods instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization.

Default: true

Since: 3.0.0

spark.sql.codegen.comments

Controls whether CodegenContext should register comments (true) or not (false).

Default: false

spark.sql.codegen.factoryMode

(internal) Determines the codegen generator fallback behavior

Default: FALLBACK

Acceptable values:

  • CODEGEN_ONLY - disable fallback mode
  • FALLBACK - try codegen first and, if any compile error happens, fallback to interpreted mode
  • NO_CODEGEN - skips codegen and always uses interpreted path

Used when CodeGeneratorWithInterpretedFallback is requested to createObject (when UnsafeProjection is requested to create an UnsafeProjection for Catalyst expressions)

spark.sql.codegen.fallback

(internal) Whether the whole stage codegen could be temporary disabled for the part of a query that has failed to compile generated code (true) or not (false).

Default: true

Use SQLConf.wholeStageFallback method to access the current value.

spark.sql.codegen.hugeMethodLimit

(internal) The maximum bytecode size of a single compiled Java function generated by whole-stage codegen.

Default: 65535

The default value 65535 is the largest bytecode size possible for a valid Java method. When running on HotSpot, it may be preferable to set the value to 8000 (which is the value of HugeMethodLimit in the OpenJDK JVM settings)

Use SQLConf.hugeMethodLimit method to access the current value.

spark.sql.codegen.useIdInClassName

(internal) Controls whether to embed the (whole-stage) codegen stage ID into the class name of the generated class as a suffix (true) or not (false)

Default: true

Use SQLConf.wholeStageUseIdInClassName method to access the current value.

spark.sql.codegen.maxFields

(internal) Maximum number of output fields (including nested fields) that whole-stage codegen supports. Going above the number deactivates whole-stage codegen.

Default: 100

Use SQLConf.wholeStageMaxNumFields method to access the current value.

spark.sql.codegen.splitConsumeFuncByOperator

(internal) Controls whether whole stage codegen puts the logic of consuming rows of each physical operator into individual methods, instead of a single big method. This can be used to avoid oversized function that can miss the opportunity of JIT optimization.

Default: true

Use SQLConf.wholeStageSplitConsumeFuncByOperator method to access the current value.

spark.sql.columnVector.offheap.enabled

(internal) Enables OffHeapColumnVector in ColumnarBatch (true) or not (false). When false, OnHeapColumnVector is used instead.

Default: false

Use SQLConf.offHeapColumnVectorEnabled method to access the current value.

spark.sql.columnNameOfCorruptRecord

spark.sql.constraintPropagation.enabled

(internal) When true, the query optimizer will infer and propagate data constraints in the query plan to optimize them. Constraint propagation can sometimes be computationally expensive for certain kinds of query plans (such as those with a large number of predicates and aliases) which might negatively impact overall runtime.

Default: true

Use SQLConf.constraintPropagationEnabled method to access the current value.

spark.sql.csv.filterPushdown.enabled

(internal) When true, enable filter pushdown to CSV datasource.

Default: true

Since: 3.0.0

spark.sql.defaultSizeInBytes

(internal) Estimated size of a table or relation used in query planning

Default: Java's Long.MaxValue

Set to Java's Long.MaxValue which is larger than spark.sql.autoBroadcastJoinThreshold to be more conservative. That is to say by default the optimizer will not choose to broadcast a table unless it knows for sure that the table size is small enough.

Used by the planner to decide when it is safe to broadcast a relation. By default, the system will assume that tables are too large to broadcast.

Use SQLConf.defaultSizeInBytes method to access the current value.

spark.sql.dialect

spark.sql.exchange.reuse

(internal) When enabled (true), the Spark planner will find duplicated exchanges and subqueries and re-use them.

When disabled (false), ReuseExchange and ReuseSubquery physical optimizations (that the Spark planner uses for physical query plan optimization) do nothing.

Default: true

Use SQLConf.exchangeReuseEnabled method to access the current value.

spark.sql.execution.useObjectHashAggregateExec

Enables ObjectHashAggregateExec when Aggregation execution planning strategy is executed.

Default: true

Use SQLConf.useObjectHashAggregation method to access the current value.

spark.sql.files.ignoreCorruptFiles

Controls whether to ignore corrupt files (true) or not (false). If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned.

Default: false

Use SQLConf.ignoreCorruptFiles method to access the current value.

spark.sql.files.ignoreMissingFiles

Controls whether to ignore missing files (true) or not (false). If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned.

Default: false

Use SQLConf.ignoreMissingFiles method to access the current value.

spark.sql.files.maxRecordsPerFile

Maximum number of records to write out to a single file. If this value is 0 or negative, there is no limit.

Default: 0

Use SQLConf.maxRecordsPerFile method to access the current value.

spark.sql.files.maxPartitionBytes

The maximum number of bytes to pack into a single partition when reading files.

Default: 128 * 1024 * 1024 (which corresponds to parquet.block.size)

Use SQLConf.filesMaxPartitionBytes method to access the current value.

spark.sql.files.openCostInBytes

(internal) The estimated cost to open a file, measured by the number of bytes could be scanned at the same time (to include multiple files into a partition).

Default: 4 * 1024 * 1024

It's better to over estimate it, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first).

Use SQLConf.filesOpenCostInBytes method to access the current value.

spark.sql.inMemoryColumnarStorage.compressed

When enabled, Spark SQL will automatically select a compression codec for each column based on statistics of the data.

Default: true

Use SQLConf.useCompression method to access the current value.

spark.sql.inMemoryColumnarStorage.batchSize

Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization and compression, but risk OOMs when caching data.

Default: 10000

Use SQLConf.columnBatchSize method to access the current value.

spark.sql.inMemoryTableScanStatistics.enable

(internal) When true, enable in-memory table scan accumulators.

Default: false

Since: 3.0.0

spark.sql.inMemoryColumnarStorage.enableVectorizedReader

Enables vectorized reader for columnar caching.

Default: true

Use SQLConf.cacheVectorizedReaderEnabled method to access the current value.

spark.sql.inMemoryColumnarStorage.partitionPruning

(internal) Enables partition pruning for in-memory columnar tables

Default: true

Use SQLConf.inMemoryPartitionPruning method to access the current value.

spark.sql.join.preferSortMergeJoin

(internal) Controls whether JoinSelection execution planning strategy prefers sort merge join over shuffled hash join.

Default: true

Use SQLConf.preferSortMergeJoin method to access the current value.

spark.sql.jsonGenerator.ignoreNullFields

Whether to ignore null fields when generating JSON objects in JSON data source and JSON functions such as to_json. If false, it generates null for null fields in JSON objects.

Default: true

Since: 3.0.0

spark.sql.legacy.doLooseUpcast

(internal) When true, the upcast will be loose and allows string to atomic types.

Default: false

Since: 3.0.0

spark.sql.legacy.ctePrecedencePolicy

(internal) When LEGACY, outer CTE definitions takes precedence over inner definitions. If set to CORRECTED, inner CTE definitions take precedence. The default value is EXCEPTION, AnalysisException is thrown while name conflict is detected in nested CTE. This config will be removed in future versions and CORRECTED will be the only behavior.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

Since: 3.0.0

spark.sql.legacy.timeParserPolicy

(internal) When LEGACY, java.text.SimpleDateFormat is used for formatting and parsing dates/timestamps in a locale-sensitive manner, which is the approach before Spark 3.0. When set to CORRECTED, classes from java.time.* packages are used for the same purpose. The default value is EXCEPTION, RuntimeException is thrown when we will get different results.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

Since: 3.0.0

spark.sql.legacy.followThreeValuedLogicInArrayExists

(internal) When true, the ArrayExists will follow the three-valued boolean logic.

Default: true

Since: 3.0.0

spark.sql.legacy.fromDayTimeString.enabled

(internal) When true, the from bound is not taken into account in conversion of a day-time string to an interval, and the to bound is used to skip all interval units out of the specified range. When false, ParseException is thrown if the input does not match to the pattern defined by from and to.

Default: false

Since: 3.0.0

spark.sql.legacy.notReserveProperties

(internal) When true, all database and table properties are not reserved and available for create/alter syntaxes. But please be aware that the reserved properties will be silently removed.

Default: false

Since: 3.0.0

spark.sql.legacy.addSingleFileInAddFile

(internal) When true, only a single file can be added using ADD FILE. If false, then users can add directory by passing directory path to ADD FILE.

Default: false

Since: 3.0.0

spark.sql.legacy.exponentLiteralAsDecimal.enabled

(internal) When true, a literal with an exponent (e.g. 1E-30) would be parsed as Decimal rather than Double.

Default: false

Since: 3.0.0

spark.sql.legacy.allowNegativeScaleOfDecimal

(internal) When true, negative scale of Decimal type is allowed. For example, the type of number 1E10BD under legacy mode is DecimalType(2, -9), but is Decimal(11, 0) in non legacy mode.

Default: false

Since: 3.0.0

spark.sql.legacy.bucketedTableScan.outputOrdering

(internal) When true, the bucketed table scan will list files during planning to figure out the output ordering, which is expensive and may make the planning quite slow.

Default: false

Since: 3.0.0

spark.sql.legacy.json.allowEmptyString.enabled

(internal) When true, the parser of JSON data source treats empty strings as null for some data types such as IntegerType.

Default: false

Since: 3.0.0

spark.sql.legacy.createEmptyCollectionUsingStringType

(internal) When true, Spark returns an empty collection with StringType as element type if the array/map function is called without any parameters. Otherwise, Spark returns an empty collection with NullType as element type.

Default: false

Since: 3.0.0

spark.sql.legacy.allowUntypedScalaUDF

(internal) When true, user is allowed to use org.apache.spark.sql.functions.udf(f: AnyRef, dataType: DataType). Otherwise, an exception will be thrown at runtime.

Default: false

Since: 3.0.0

spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue

(internal) When true, the key attribute resulted from running Dataset.groupByKey for non-struct key type, will be named as value, following the behavior of Spark version 2.4 and earlier.

Default: false

Since: 3.0.0

spark.sql.legacy.setCommandRejectsSparkCoreConfs

(internal) If it is set to true, SET command will fail when the key is registered as a SparkConf entry.

Default: true

Since: 3.0.0

spark.sql.legacy.typeCoercion.datetimeToString.enabled

(internal) When true, date/timestamp will cast to string in binary comparisons with String

Default: false

Since: 3.0.0

spark.sql.legacy.allowHashOnMapType

(internal) When true, hash expressions can be applied on elements of MapType. Otherwise, an analysis exception will be thrown.

Default: false

Since: 3.0.0

spark.sql.legacy.parquet.datetimeRebaseModeInWrite

(internal) When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files. When CORRECTED, Spark will not do rebase and write the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

Since: 3.0.0

spark.sql.legacy.parquet.datetimeRebaseModeInRead

(internal) When LEGACY, Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Parquet files. When CORRECTED, Spark will not do rebase and read the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

Since: 3.0.0

spark.sql.legacy.avro.datetimeRebaseModeInWrite

(internal) When LEGACY, Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Avro files. When CORRECTED, Spark will not do rebase and write the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

Since: 3.0.0

spark.sql.legacy.avro.datetimeRebaseModeInRead

(internal) When LEGACY, Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Avro files. When CORRECTED, Spark will not do rebase and read the dates/timestamps as it is. When EXCEPTION, which is the default, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. This config is only effective if the writer info (like Spark, Hive) of the Avro files is unknown.

Possible values: EXCEPTION, LEGACY, CORRECTED

Default: EXCEPTION

Since: 3.0.0

spark.sql.legacy.rdd.applyConf

(internal) Enables propagation of SQL configurations when executing operations on the RDD that represents a structured query. This is the (buggy) behavior up to 2.4.4.

Default: true

This is for cases not tracked by SQL execution, when a Dataset is converted to an RDD either using Dataset.md#rdd[rdd] operation or QueryExecution, and then the returned RDD is used to invoke actions on it.

This config is deprecated and will be removed in 3.0.0.

spark.sql.legacy.replaceDatabricksSparkAvro.enabled

Enables resolving (mapping) the data source provider com.databricks.spark.avro to the built-in (but external) Avro data source module for backward compatibility.

Default: true

Use SQLConf.replaceDatabricksSparkAvroEnabled method to access the current value.

spark.sql.limit.scaleUpFactor

(internal) Minimal increase rate in the number of partitions between attempts when executing take operator on a structured query. Higher values lead to more partitions read. Lower values might lead to longer execution times as more jobs will be run.

Default: 4

Use SQLConf.limitScaleUpFactor method to access the current value.

spark.sql.optimizer.excludedRules

Comma-separated list of fully-qualified class names of the optimization rules that should be disabled (excluded) from logical query optimization.

Default: (empty)

Use SQLConf.optimizerExcludedRules method to access the current value.

Important

It is not guaranteed that all the rules to be excluded will eventually be excluded, as some rules are non-excludable.

spark.sql.optimizer.inSetConversionThreshold

(internal) The threshold of set size for InSet conversion.

Default: 10

Use SQLConf.optimizerInSetConversionThreshold method to access the current value.

spark.sql.optimizer.maxIterations

Maximum number of iterations for Analyzer and Logical Optimizer.

Default: 100

spark.sql.optimizer.replaceExceptWithFilter

(internal) When true, the apply function of the rule verifies whether the right node of the except operation is of type Filter or Project followed by Filter. If yes, the rule further verifies 1) Excluding the filter operations from the right (as well as the left node, if any) on the top, whether both the nodes evaluates to a same result. 2) The left and right nodes don't contain any SubqueryExpressions. 3) The output column names of the left node are distinct. If all the conditions are met, the rule will replace the except operation with a Filter by flipping the filter condition(s) of the right node.

Default: true

spark.sql.optimizer.nestedSchemaPruning.enabled

(internal) Prune nested fields from the output of a logical relation that are not necessary in satisfying a query. This optimization allows columnar file format readers to avoid reading unnecessary nested column data.

Default: true

Use SQLConf.nestedSchemaPruningEnabled method to access the current value.

spark.sql.orc.impl

(internal) When native, use the native version of ORC support instead of the ORC library in Hive 1.2.1.

Default: native

Acceptable values:

  • hive
  • native

spark.sql.pyspark.jvmStacktrace.enabled

When true, it shows the JVM stacktrace in the user-facing PySpark exception together with Python stacktrace. By default, it is disabled and hides JVM stacktrace and shows a Python-friendly exception only.

Default: false

Since: 3.0.0

spark.sql.parquet.binaryAsString

Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.

Default: false

Use SQLConf.isParquetBinaryAsString method to access the current value.

spark.sql.parquet.columnarReaderBatchSize

The number of rows to include in a parquet vectorized reader batch (the capacity of VectorizedParquetRecordReader).

Default: 4096 (4k)

The number should be carefully chosen to minimize overhead and avoid OOMs while reading data.

Use SQLConf.parquetVectorizedReaderBatchSize method to access the current value.

spark.sql.parquet.int96AsTimestamp

Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.

Default: true

Use SQLConf.isParquetINT96AsTimestamp method to access the current value.

spark.sql.parquet.enableVectorizedReader

Enables vectorized parquet decoding.

Default: true

Use SQLConf.parquetVectorizedReaderEnabled method to access the current value.

spark.sql.parquet.filterPushdown

Controls the filter predicate push-down optimization for data sources using parquet file format

Default: true

Use SQLConf.parquetFilterPushDown method to access the current value.

spark.sql.parquet.filterPushdown.date

(internal) Enables parquet filter push-down optimization for Date (when spark.sql.parquet.filterPushdown is enabled)

Default: true

Use SQLConf.parquetFilterPushDownDate method to access the current value.

spark.sql.parquet.int96TimestampConversion

Controls whether timestamp adjustments should be applied to INT96 data when converting to timestamps, for data written by Impala.

Default: false

This is necessary because Impala stores INT96 data with a different timezone offset than Hive and Spark.

Use SQLConf.isParquetINT96TimestampConversion method to access the current value.

spark.sql.parquet.recordLevelFilter.enabled

Enables Parquet's native record-level filtering using the pushed down filters (when spark.sql.parquet.filterPushdown is enabled).

Default: false

Use SQLConf.parquetRecordFilterEnabled method to access the current value.

spark.sql.parser.quotedRegexColumnNames

Controls whether quoted identifiers (using backticks) in SELECT statements should be interpreted as regular expressions.

Default: false

Use SQLConf.supportQuotedRegexColumnName method to access the current value.

spark.sql.pivotMaxValues

Maximum number of (distinct) values that will be collected without error (when doing a pivot without specifying the values for the pivot column)

Default: 10000

Use SQLConf.dataFramePivotMaxValues method to access the current value.

spark.sql.redaction.options.regex

Regular expression to find options of a Spark SQL command with sensitive information

Default: (?i)secret!password

The values of the options matched will be redacted in the explain output.

This redaction is applied on top of the global redaction configuration defined by spark.redaction.regex configuration.

Used exclusively when SQLConf is requested to redactOptions.

spark.sql.redaction.string.regex

Regular expression to point at sensitive information in text output

Default: (undefined)

When this regex matches a string part, it is replaced by a dummy value (i.e. *********(redacted)). This is currently used to redact the output of SQL explain commands.

NOTE: When this conf is not set, the value of spark.redaction.string.regex is used instead.

Use SQLConf.stringRedactionPattern method to access the current value.

spark.sql.retainGroupColumns

Controls whether to retain columns used for aggregation or not (in RelationalGroupedDataset operators).

Default: true

Use SQLConf.dataFrameRetainGroupColumns method to access the current value.

spark.sql.runSQLOnFiles

(internal) Controls whether Spark SQL could use datasource.path as a table in a SQL query.

Default: true

Use SQLConf.runSQLonFile method to access the current value.

spark.sql.selfJoinAutoResolveAmbiguity

Controls whether to resolve ambiguity in join conditions for self-joins automatically (true) or not (false)

Default: true

spark.sql.sort.enableRadixSort

(internal) Controls whether to use radix sort (true) or not (false) in ShuffleExchangeExec and SortExec physical operators

Default: true

Radix sort is much faster but requires additional memory to be reserved up-front. The memory overhead may be significant when sorting very small rows (up to 50% more).

Use SQLConf.enableRadixSort method to access the current value.

spark.sql.sources.bucketing.enabled

Enables bucketing support. When disabled (i.e. false), bucketed tables are considered regular (non-bucketed) tables.

Default: true

Use SQLConf.bucketingEnabled method to access the current value.

spark.sql.sources.default

Default data source to use for loading or saving data

Default: parquet

Use SQLConf.defaultDataSourceName method to access the current value.

spark.sql.statistics.fallBackToHdfs

Enables automatic calculation of table size statistic by falling back to HDFS if the table statistics are not available from table metadata.

Default: false

This can be useful in determining if a table is small enough for auto broadcast joins in query planning.

Use SQLConf.fallBackToHdfsForStatsEnabled method to access the current value.

spark.sql.statistics.histogram.numBins

(internal) The number of bins when generating histograms.

Default: 254

NOTE: The number of bins must be greater than 1.

Use SQLConf.histogramNumBins method to access the current value.

spark.sql.statisticsparallelFileListingInStatsComputation.enabled*

(internal) Enables parallel file listing in SQL commands, e.g. ANALYZE TABLE (as opposed to single thread listing that can be particularly slow with tables with hundreds of partitions)

Default: true

Use SQLConf.parallelFileListingInStatsComputation method to access the current value.

spark.sql.statistics.ndv.maxError

(internal) The maximum estimation error allowed in HyperLogLog++ algorithm when generating column level statistics.

Default: 0.05

spark.sql.statistics.percentile.accuracy

(internal) Accuracy of percentile approximation when generating equi-height histograms. Larger value means better accuracy. The relative error can be deduced by 1.0 / PERCENTILE_ACCURACY.

Default: 10000

spark.sql.statistics.size.autoUpdate.enabled

Enables automatic update of the table size statistic of a table after the table has changed.

Default: false

IMPORTANT: If the total number of files of the table is very large this can be expensive and slow down data change commands.

Use SQLConf.autoSizeUpdateEnabled method to access the current value.

spark.sql.subexpressionElimination.enabled

(internal) Enables subexpression elimination

Default: true

Use SQLConf.subexpressionEliminationEnabled method to access the current value.

spark.sql.shuffle.partitions

The default number of partitions to use when shuffling data for joins or aggregations.

Default: 200

Note

Corresponds to Apache Hive's mapred.reduce.tasks property that Spark SQL considers deprecated.

Spark Structured Streaming

spark.sql.shuffle.partitions cannot be changed in Spark Structured Streaming between query restarts from the same checkpoint location.

Use SQLConf.numShufflePartitions method to access the current value.

spark.sql.sources.fileCompressionFactor

(internal) When estimating the output data size of a table scan, multiply the file size with this factor as the estimated data size, in case the data is compressed in the file and lead to a heavily underestimated result.

Default: 1.0

Use SQLConf.fileCompressionFactor method to access the current value.

spark.sql.sources.partitionOverwriteMode

Enables dynamic partition inserts when dynamic

Default: static

When INSERT OVERWRITE a partitioned data source table with dynamic partition columns, Spark SQL supports two modes (case-insensitive):

  • static - Spark deletes all the partitions that match the partition specification (e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting

  • dynamic - Spark doesn't delete partitions ahead, and only overwrites those partitions that have data written into it

The default STATIC overwrite mode is to keep the same behavior of Spark prior to 2.3. Note that this config doesn't affect Hive serde tables, as they are always overwritten with dynamic mode.

Use SQLConf.partitionOverwriteMode method to access the current value.

spark.sql.truncateTable.ignorePermissionAcl.enabled

(internal) Disables setting back original permission and ACLs when re-creating the table/partition paths for TRUNCATE TABLE command.

Default: false

Use SQLConf.truncateTableIgnorePermissionAcl method to access the current value.

spark.sql.ui.retainedExecutions

The number of SQLExecutionUIData entries to keep in failedExecutions and completedExecutions internal registries.

Default: 1000

When a query execution finishes, the execution is removed from the internal activeExecutions registry and stored in failedExecutions or completedExecutions given the end execution status. It is when SQLListener makes sure that the number of SQLExecutionUIData entires does not exceed spark.sql.ui.retainedExecutions Spark property and removes the excess of entries.

spark.sql.windowExec.buffer.in.memory.threshold

(internal) Threshold for number of rows guaranteed to be held in memory by WindowExec physical operator.

Default: 4096

Use SQLConf.windowExecBufferInMemoryThreshold method to access the current value.

spark.sql.windowExec.buffer.spill.threshold

(internal) Threshold for number of rows buffered in a WindowExec physical operator.

Default: 4096

Use SQLConf.windowExecBufferSpillThreshold method to access the current value.


Last update: 2021-03-18