[SPARK-15474] ORC data source fails to write and read back empty dataframe - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0, 2.1.1, 2.2.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

Currently ORC data source fails to write and read empty data.

The code below:

val emptyDf = spark.range(10).limit(0)
emptyDf.write
  .format("orc")
  .save(path.getCanonicalPath)

val copyEmptyDf = spark.read
  .format("orc")
  .load(path.getCanonicalPath)

copyEmptyDf.show()

throws an exception below:

Unable to infer schema for ORC at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. It must be specified manually;
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. It must be specified manually;
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
	at org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)

Note that this is a different case with the data below

val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)

In this case, any writer is not initialised and created. (no calls of WriterContainer.writeRows().

For Parquet and JSON, it works but ORC does not.

Attachments

Issue Links

blocks

SPARK-20901 Feature parity for ORC with Parquet

Open

is broken by

ORC-152 Saving empty Spark DataFrame via ORC does not preserve schema

Closed

relates to

SPARK-8501 ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

Resolved

links to

[Github] Pull Request #13257 (sbcd90)

[Github] Pull Request #19571 (dongjoon-hyun)

[Github] Pull Request #19651 (dongjoon-hyun)

(1 links to)

Activity

People

Assignee:: Dongjoon Hyun

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/May/16 11:45

Updated:: 12/Dec/22 18:10

Resolved:: 03/Dec/17 14:25