Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
Description
Spark 2.4 introduced a change to the way csv files are read. See Upgrading From Spark SQL 2.3 to 2.4 for more details.
In that document, it states: To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.
I am configuring Spark 2.4.4 as such, yet I'm still getting results inconsistent with pre-2.4. For example:
Consider this file (fruit.csv). Notice it contains a header record, 3 valid records, and one malformed record.
fruit,color,price,quantity apple,red,1,3 banana,yellow,2,4 orange,orange,3,5 xxx
With Spark 2.1.1, if I call .count() on a DataFrame created from this file (using option DROPMALFORMED), "3" is returned.
(using Spark 2.1.1) scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count 19/09/16 14:28:01 WARN CSVRelation: Dropping malformed line: xxx res1: Long = 3
With Spark 2.4.4, I set the "spark.sql.csv.parser.columnPruning.enabled" option to false to restore the pre-2.4 behavior for handling malformed records, then call .count() and "4" is returned.
(using spark 2.4.4) scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false) scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count res1: Long = 4
So, using the spark.sql.csv.parser.columnPruning.enabled option did not actually restore previous behavior.
How can I, using Spark 2.4+, get a count of the records in a .csv which excludes malformed records?