Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15474

ORC data source fails to write and read back empty dataframe

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0, 2.1.1, 2.2.0
    • Fix Version/s: 2.3.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Currently ORC data source fails to write and read empty data.

      The code below:

      val emptyDf = spark.range(10).limit(0)
      emptyDf.write
        .format("orc")
        .save(path.getCanonicalPath)
      
      val copyEmptyDf = spark.read
        .format("orc")
        .load(path.getCanonicalPath)
      
      copyEmptyDf.show()
      

      throws an exception below:

      Unable to infer schema for ORC at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. It must be specified manually;
      org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. It must be specified manually;
      	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
      	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
      	at scala.Option.getOrElse(Option.scala:121)
      	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
      	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
      	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
      	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
      	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
      

      Note that this is a different case with the data below

      val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
      

      In this case, any writer is not initialised and created. (no calls of WriterContainer.writeRows().

      For Parquet and JSON, it works but ORC does not.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dongjoon Dongjoon Hyun
                Reporter:
                hyukjin.kwon Hyukjin Kwon
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: