[SPARK-29101] CSV datasource returns incorrect .count() from file with malformed records - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
Fix Version/s: 2.4.5, 3.0.0
Component/s: SQL
Labels:
- correctness

Description

Spark 2.4 introduced a change to the way csv files are read. See Upgrading From Spark SQL 2.3 to 2.4 for more details.

In that document, it states: To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.

I am configuring Spark 2.4.4 as such, yet I'm still getting results inconsistent with pre-2.4. For example:

Consider this file (fruit.csv). Notice it contains a header record, 3 valid records, and one malformed record.

fruit,color,price,quantity
apple,red,1,3
banana,yellow,2,4
orange,orange,3,5
xxx

With Spark 2.1.1, if I call .count() on a DataFrame created from this file (using option DROPMALFORMED), "3" is returned.

(using Spark 2.1.1)
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count
19/09/16 14:28:01 WARN CSVRelation: Dropping malformed line: xxx
res1: Long = 3

With Spark 2.4.4, I set the "spark.sql.csv.parser.columnPruning.enabled" option to false to restore the pre-2.4 behavior for handling malformed records, then call .count() and "4" is returned.

(using spark 2.4.4)
scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count
res1: Long = 4

So, using the spark.sql.csv.parser.columnPruning.enabled option did not actually restore previous behavior.

How can I, using Spark 2.4+, get a count of the records in a .csv which excludes malformed records?

Attachments

Issue Links

links to

GitHub Pull Request #25820

GitHub Pull Request #25843

Activity

People

Assignee:: Sandeep Katta

Reporter:: Stuart White

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Sep/19 19:38

Updated:: 12/Dec/22 18:10

Resolved:: 18/Sep/19 14:34