Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28058

Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4.1, 2.4.3
    • Fix Version/s: 2.4.4, 3.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      The spark sql csv reader is not dropping malformed records as expected.

      Consider this file (fruit.csv). Notice it contains a header record, 3 valid records, and one malformed record.

      fruit,color,price,quantity
      apple,red,1,3
      banana,yellow,2,4
      orange,orange,3,5
      xxx
      

      If I read this file using the spark sql csv reader as follows, everything looks good. The malformed record is dropped.

      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
      +------+------+-----+--------+                                                  
      |fruit |color |price|quantity|
      +------+------+-----+--------+
      |apple |red   |1    |3       |
      |banana|yellow|2    |4       |
      |orange|orange|3    |5       |
      +------+------+-----+--------+
      

      However, if I select a subset of the columns, the malformed record is not dropped. The malformed data is placed in the first column, and the remaining column(s) are filled with nulls.

      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
      +------+
      |fruit |
      +------+
      |apple |
      |banana|
      |orange|
      |xxx   |
      +------+
      
      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
      +------+------+
      |fruit |color |
      +------+------+
      |apple |red   |
      |banana|yellow|
      |orange|orange|
      |xxx   |null  |
      +------+------+
      
      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price).show(truncate=false)
      +------+------+-----+
      |fruit |color |price|
      +------+------+-----+
      |apple |red   |1    |
      |banana|yellow|2    |
      |orange|orange|3    |
      |xxx   |null  |null |
      +------+------+-----+
      

      And finally, if I manually select all of the columns, the malformed record is once again dropped.

      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 'quantity).show(truncate=false)
      +------+------+-----+--------+
      |fruit |color |price|quantity|
      +------+------+-----+--------+
      |apple |red   |1    |3       |
      |banana|yellow|2    |4       |
      |orange|orange|3    |5       |
      +------+------+-----+--------+
      

      I would expect the malformed record(s) to be dropped regardless of which columns are being selected from the file.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                viirya L. C. Hsieh
                Reporter:
                stwhit Stuart White
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: