Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29101

CSV datasource returns incorrect .count() from file with malformed records

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
    • Fix Version/s: 2.4.5, 3.0.0
    • Component/s: SQL
    • Labels:

      Description

      Spark 2.4 introduced a change to the way csv files are read.  See Upgrading From Spark SQL 2.3 to 2.4 for more details.

      In that document, it states: To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.

      I am configuring Spark 2.4.4 as such, yet I'm still getting results inconsistent with pre-2.4.  For example:

      Consider this file (fruit.csv).  Notice it contains a header record, 3 valid records, and one malformed record.

      fruit,color,price,quantity
      apple,red,1,3
      banana,yellow,2,4
      orange,orange,3,5
      xxx
      

       
      With Spark 2.1.1, if I call .count() on a DataFrame created from this file (using option DROPMALFORMED), "3" is returned.

      (using Spark 2.1.1)
      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count
      19/09/16 14:28:01 WARN CSVRelation: Dropping malformed line: xxx
      res1: Long = 3
      

      With Spark 2.4.4, I set the "spark.sql.csv.parser.columnPruning.enabled" option to false to restore the pre-2.4 behavior for handling malformed records, then call .count() and "4" is returned.

      (using spark 2.4.4)
      scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)
      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count
      res1: Long = 4
      

      So, using the spark.sql.csv.parser.columnPruning.enabled option did not actually restore previous behavior.

      How can I, using Spark 2.4+, get a count of the records in a .csv which excludes malformed records?

        Attachments

          Activity

            People

            • Assignee:
              sandeep.katta2007 Sandeep Katta
              Reporter:
              stwhit Stuart White
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: