Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15474

ORC data source fails to write and read back empty dataframe

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0, 2.1.1, 2.2.0
    • 2.3.0
    • SQL
    • None

    Description

      Currently ORC data source fails to write and read empty data.

      The code below:

      val emptyDf = spark.range(10).limit(0)
      emptyDf.write
        .format("orc")
        .save(path.getCanonicalPath)
      
      val copyEmptyDf = spark.read
        .format("orc")
        .load(path.getCanonicalPath)
      
      copyEmptyDf.show()
      

      throws an exception below:

      Unable to infer schema for ORC at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. It must be specified manually;
      org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. It must be specified manually;
      	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
      	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
      	at scala.Option.getOrElse(Option.scala:121)
      	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
      	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
      	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
      	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
      	at org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
      

      Note that this is a different case with the data below

      val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
      

      In this case, any writer is not initialised and created. (no calls of WriterContainer.writeRows().

      For Parquet and JSON, it works but ORC does not.

      Attachments

        Issue Links

          Activity

            People

              dongjoon Dongjoon Hyun
              hyukjin.kwon Hyukjin Kwon
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: