[SPARK-33940] allow configuring the max column name length in csv writer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.1.1
Component/s: SQL
Labels:
None

Description

csv writer actually has an implicit limit on column name length due to univocity-parser,

when we initialize a writer https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211, it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java eventually (https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)

in that stringCache.get, it has a maxStringLength cap https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104 which is 1024 by default

we do not expose this as configurable option, leading to NPE when we have a column name larger than 1024,

```

[info] Cause: java.lang.NullPointerException:

[info] at com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)

[info] at com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)

[info] at com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)

[info] at org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)

[info] at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)

[info] at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:44)

[info] at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)

[info] at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)

[info] at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)

[info] at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)

[info] at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)

```

it could be reproduced by a simple unit test

```

val row1 = Row("a")
val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
df.repartition(1)
.write
.option("header", "true")
.option("maxColumnNameLength", 1025)
.csv(dataPath)

```

Attachments

Issue Links

links to

[Github] Pull Request #30972 (CodingCat)

[Github] Pull Request #31246 (CodingCat)

Activity

People

Assignee:: Nan Zhu

Reporter:: Nan Zhu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Dec/20 07:48

Updated:: 12/Dec/22 18:11

Resolved:: 20/Jan/21 02:41