[SPARK-32614] Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.3, 2.4.5, 3.0.0
Fix Version/s: 3.0.1, 3.1.0
Component/s: Spark Core, SQL
Labels:
- correctness

Description

In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u0000 or null character.Though user can set any comment character other than \u0000, but there is a chance the actual record can start with those characters.

Currently for the below piece of code and the given testdata where first row starts with null \u0000
character it will throw the below error.

*eg: val df = spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
df.show(false);*

TestData

Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)

Note:

Though its the limitation of the univocity parser and the workaround is to provide any other comment character by mentioning .option("comment","#"), but if my actual data starts with this character then the particular row will be discarded.

Currently I pushed the code in univocity parser to handle this scenario as part of the below PR
https://github.com/uniVocity/univocity-parsers/pull/412

please accept the jira so that we can enable this feature in spark-csv by adding a parameter in spark csvoptions.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot-1.png
14/Aug/20 04:56
2 kB
Chandan Ray

Issue Links

links to

[Github] Pull Request #29516 (srowen)

Activity

People

Assignee:: Sean R. Owen

Reporter:: Chandan Ray

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Aug/20 04:42

Updated:: 12/Dec/22 18:11

Resolved:: 25/Aug/20 15:29