Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24204

Verify a write schema in Json/Orc/ParquetFileFormat

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • SQL
    • None

    Description

      SUMMARY

      • CSV: Raising analysis exception.
      • JSON: dropping columns with null types
      • Parquet/ORC: raising runtime exceptions

      The native orc file format throws an exception with a meaningless message in executor-sides when unsupported types passed;

      
      scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, null)))
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("b", NullType) :: Nil)
      scala> val df = spark.createDataFrame(rdd, schema)
      scala> df.write.orc("/tmp/orc")
      java.lang.IllegalArgumentException: Can't parse category at 'struct<a:int,b:null^>'
              at org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
              at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
              at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
              at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
              at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
              at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
      er.scala:226)
              at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.<init>(OrcSerializer.scala:36)
              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:36)
              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
              at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
              at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
      (FileFormatWriter.scala:278)
      

      It seems to be better to verify a write schema in a driver side for users along with the CSV fromat;
      https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65

      Attachments

        Activity

          People

            maropu Takeshi Yamamuro
            maropu Takeshi Yamamuro
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: