DataFrameReader supports many <> natively and offers the <>.
Note
DataFrameReader assumes <> data source file format by default that you can change using spark.sql.sources.default configuration property.
After you have described the loading pipeline (i.e. the "Extract" part of ETL in Spark SQL), you eventually "trigger" the loading using format-agnostic <> or format-specific (e.g. <>, <>, <>) operators.
import org.apache.spark.sql.SparkSession val spark: SparkSession = ...
import org.apache.spark.sql.Dataset val lines: Dataset[String] = spark .read .textFile("README.md")
NOTE: Loading datasets using <> methods allows for additional preprocessing before final processing of the string values as <> or <> lines.
[[loading-dataset-of-string]] DataFrameReader can load datasets from Dataset[String] (with lines being complete "files") using format-specific <> and <> operators.
val parquetWriter = tokens.write parquetWriter.option("compression", "none").save("hello-none")
// The exception is mostly for my learning purposes // so I know where and how to find the trace to the compressions // Sorry... scala> parquetWriter.option("compression", "unsupported").save("hello-unsupported") java.lang.IllegalArgumentException: Codec [unsupported] is not available. Available codecs are uncompressed, gzip, lzo, snappy, none. at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.(ParquetOptions.scala:43) at org.apache.spark.sql.execution.datasources.parquet.DefaultSource.prepareWrite(ParquetRelation.scala:77) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationanonfunrunrunrunrun1anonfun4.apply(InsertIntoHadoopFsRelation.scala:122) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation4.apply(InsertIntoHadoopFsRelation.scala:122) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation4.apply(InsertIntoHadoopFsRelation.scala:122) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation4.apply(InsertIntoHadoopFsRelation.scala:122) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationanonfunrun1anonfun4.apply(InsertIntoHadoopFsRelation.scala:122) at org.apache.spark.sql.execution.datasources.BaseWriterContainer.driverSideSetup(WriterContainer.scala:103) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationanonfunrunrunrunrun1.applymcVsp(InsertIntoHadoopFsRelation.scala:141) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationanonfunrunrunrunrun1.apply(InsertIntoHadoopFsRelation.scala:116) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationanonfunrunrunrunrun1.apply(InsertIntoHadoopFsRelation.scala:116) at org.apache.spark.sql.execution.SQLExecution.withNewExecutionId(SQLExecution.scala:53) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:116) at org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResultlzycompute(commands.scala:61) at org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult(commands.scala:59) at org.apache.spark.sql.execution.command.ExecutedCommand.doExecute(commands.scala:73) at org.apache.spark.sql.execution.SparkPlananonfunexecuteexecuteexecuteexecute1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlananonfunexecuteexecuteexecuteexecute1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlananonfunexecuteQueryexecuteQueryexecuteQueryexecuteQuery1.apply(SparkPlan.scala:137) at org.apache.spark.rdd.RDDOperationScope.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117) at org.apache.spark.sql.execution.QueryExecution.toRddlzycompute(QueryExecution.scala:65) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:65) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:390) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230) ... 48 elided
Optimized Row Columnar (ORC) file format is a highly efficient columnar format to store Hive data with more than 1,000 columns and improve performance. ORC format was introduced in Hive version 0.11 to use and retain the type information from the table definition.
jdbc loads data from an external table using the JDBC data source.
Internally, jdbc creates a spark-sql-JDBCOptions.md#creating-instance[JDBCOptions] from the input url, table and extraOptions with connectionProperties.
jdbc then creates one JDBCPartition per predicates.
In the end, jdbc requests the <> to SparkSession.md#baseRelationToDataFrame[create a DataFrame] for a JDBCRelation (with JDBCPartitions and JDBCOptions created earlier).
import org.apache.spark.sql.SparkSession val spark: SparkSession = ...
import org.apache.spark.sql.Dataset val lines: Dataset[String] = spark .read .textFile("README.md")
NOTE: textFile are similar to <> family of methods in that they both read text files but text methods return untyped DataFrame while textFile return typed Dataset[String].
Internally, textFile passes calls on to <> method and Dataset.md#select[selects] the only value column before it applies Encoders.STRINGencoder.
Once defined explicitly (using <> method) or implicitly (spark.sql.sources.default configuration property), source is resolved using DataSource utility.
NOTE: source is used exclusively when DataFrameReader is requested to <> (explicitly or using <>).
=== [[internal-properties]] Internal Properties
[cols="30m,70",options="header",width="100%"] |=== | Name | Description