Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.3.0, 3.2.2, 3.4.0
Description
I have found that depending on the name of the corrupt record in CSV, the field is populated incorrectly. Here is an example:
1,a > /tmp/file.csv === val df = spark.read .schema("c1 int, c2 string, x string, _corrupt_record string") .csv("file:/tmp/file.csv") .withColumn("x", lit("A")) Result: +---+---+---+---------------+ |c1 |c2 |x |_corrupt_record| +---+---+---+---------------+ |1 |a |A |1,a | +---+---+---+---------------+
However, if you rename the _corrupt_record column to something else, the result is different:
val df = spark.read .option("columnNameCorruptRecord", "corrupt_record") .schema("c1 int, c2 string, x string, corrupt_record string") .csv("file:/tmp/file.csv") .withColumn("x", lit("A")) Result: +---+---+---+--------------+ |c1 |c2 |x |corrupt_record| +---+---+---+--------------+ |1 |a |A |null | +---+---+---+--------------+
This is due to inconsistency in CSVFileFormat, when enabling columnPruning, we check SQLConf option for corrupt records but CSV reader relies on columnNameCorruptRecord option instead.
Also, this disables column pruning which used to work in Spark version prior to https://github.com/apache/spark/commit/959694271e30879c944d7fd5de2740571012460a.