[SPARK-40468] Column pruning is not handled correctly in CSV when _corrupt_record is used - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0, 3.2.2, 3.4.0
Fix Version/s: 3.3.1, 3.4.0
Component/s: SQL
Labels:
- correctness

Description

I have found that depending on the name of the corrupt record in CSV, the field is populated incorrectly. Here is an example:

1,a > /tmp/file.csv

===

val df = spark.read
  .schema("c1 int, c2 string, x string, _corrupt_record string")
  .csv("file:/tmp/file.csv")
  .withColumn("x", lit("A"))

Result:

+---+---+---+---------------+
|c1 |c2 |x  |_corrupt_record|
+---+---+---+---------------+
|1  |a  |A  |1,a            |
+---+---+---+---------------+

However, if you rename the _corrupt_record column to something else, the result is different:

val df = spark.read 
  .option("columnNameCorruptRecord", "corrupt_record")
  .schema("c1 int, c2 string, x string, corrupt_record string") 
  .csv("file:/tmp/file.csv") .withColumn("x", lit("A")) 

Result:

+---+---+---+--------------+
|c1 |c2 |x  |corrupt_record|
+---+---+---+--------------+
|1  |a  |A  |null          |
+---+---+---+--------------+

This is due to inconsistency in CSVFileFormat, when enabling columnPruning, we check SQLConf option for corrupt records but CSV reader relies on columnNameCorruptRecord option instead.

Also, this disables column pruning which used to work in Spark version prior to https://github.com/apache/spark/commit/959694271e30879c944d7fd5de2740571012460a.

Attachments

Issue Links

links to

[Github] Pull Request #37909 (sadikovi)

Activity

People

Assignee:: Ivan Sadikov

Reporter:: Ivan Sadikov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Sep/22 04:36

Updated:: 28/Sep/22 03:33

Resolved:: 17/Sep/22 08:00