Description
Currently ORC data source fails to write and read empty data.
The code below:
val emptyDf = spark.range(10).limit(0) emptyDf.write .format("orc") .save(path.getCanonicalPath) val copyEmptyDf = spark.read .format("orc") .load(path.getCanonicalPath) copyEmptyDf.show()
throws an exception below:
Unable to infer schema for ORC at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. It must be specified manually; org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. It must be specified manually; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140) at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892) at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884) at org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
Note that this is a different case with the data below
val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
In this case, any writer is not initialised and created. (no calls of WriterContainer.writeRows().
For Parquet and JSON, it works but ORC does not.
Attachments
Issue Links
- blocks
-
SPARK-20901 Feature parity for ORC with Parquet
- Open
- is broken by
-
ORC-152 Saving empty Spark DataFrame via ORC does not preserve schema
- Closed
- relates to
-
SPARK-8501 ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery
- Resolved
- links to