Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40468

Column pruning is not handled correctly in CSV when _corrupt_record is used

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0, 3.2.2, 3.4.0
    • 3.3.1, 3.4.0
    • SQL

    Description

      I have found that depending on the name of the corrupt record in CSV, the field is populated incorrectly. Here is an example:

      1,a > /tmp/file.csv
      
      ===
      
      val df = spark.read
        .schema("c1 int, c2 string, x string, _corrupt_record string")
        .csv("file:/tmp/file.csv")
        .withColumn("x", lit("A"))
      
      Result:
      
      +---+---+---+---------------+
      |c1 |c2 |x  |_corrupt_record|
      +---+---+---+---------------+
      |1  |a  |A  |1,a            |
      +---+---+---+---------------+

       

      However, if you rename the _corrupt_record column to something else, the result is different:

      val df = spark.read 
        .option("columnNameCorruptRecord", "corrupt_record")
        .schema("c1 int, c2 string, x string, corrupt_record string") 
        .csv("file:/tmp/file.csv") .withColumn("x", lit("A")) 
      
      Result:
      
      +---+---+---+--------------+
      |c1 |c2 |x  |corrupt_record|
      +---+---+---+--------------+
      |1  |a  |A  |null          |
      +---+---+---+--------------+

       

      This is due to inconsistency in CSVFileFormat, when enabling columnPruning, we check SQLConf option for corrupt records but CSV reader relies on columnNameCorruptRecord option instead.

      Also, this disables column pruning which used to work in Spark version prior to https://github.com/apache/spark/commit/959694271e30879c944d7fd5de2740571012460a.

      Attachments

        Activity

          People

            ivan.sadikov Ivan Sadikov
            ivan.sadikov Ivan Sadikov
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: