Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33940

allow configuring the max column name length in csv writer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.1
    • SQL
    • None

    Description

      csv writer actually has an implicit limit on column name length due to univocity-parser, 

       

      when we initialize a writer https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211, it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java eventually (https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)

       

      in that stringCache.get, it has a maxStringLength cap https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104 which is 1024 by default

       

      we do not expose this as configurable option, leading to NPE when we have a column name larger than 1024, 

       

      ```

      [info]   Cause: java.lang.NullPointerException:

      [info]   at com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)

      [info]   at com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)

      [info]   at com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)

      [info]   at org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)

      [info]   at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)

      [info]   at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:44)

      [info]   at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)

      [info]   at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)

      [info]   at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)

      [info]   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)

      [info]   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)

      ```

       

      it could be reproduced by a simple unit test

       

      ```

      val row1 = Row("a")
      val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
      val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
      df.repartition(1)
      .write
      .option("header", "true")
      .option("maxColumnNameLength", 1025)
      .csv(dataPath)

      ```

       

      Attachments

        Activity

          People

            codingcat Nan Zhu
            codingcat Nan Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: