Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.2.0
-
None
-
None
-
Same behavior on Debian and MS Windows (8.1) system. JRE 1.8
Description
It seems that doing counts after filtering for the parser-created columnNameOfCorruptRecord and doing a count afterwards does not recognize any invalid row that was put to this special column.
Filtering for members of the actualSchema works fine and yields correct counts.
Input CSV example:
val1, cat1, 1.337 val2, cat1, 1.337 val3, cat2, 42.0 some, invalid, line
Code snippet:
StructType schema = new StructType(new StructField[] { new StructField("s1", DataTypes.StringType, true, Metadata.empty()), new StructField("s2", DataTypes.StringType, true, Metadata.empty()), new StructField("d1", DataTypes.DoubleType, true, Metadata.empty()), new StructField("FALLBACK", DataTypes.StringType, true, Metadata.empty())}); Dataset<Row> csv = sqlContext.read() .option("header", "false") .option("parserLib", "univocity") .option("mode", "PERMISSIVE") .option("maxCharsPerColumn", 10000000) .option("ignoreLeadingWhiteSpace", "false") .option("ignoreTrailingWhiteSpace", "false") .option("comment", null) .option("header", "false") .option("columnNameOfCorruptRecord", "FALLBACK") .schema(schema) .csv(path/to/csv/file); long validCount = csv.filter("FALLBACK IS NULL").count(); long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();
Expected:
validCount is 3
Invalid Count is 1
Actual:
validCount is 4
Invalid Count is 0
Caching the csv after load solves the problem and shows the correct counts.
Attachments
Issue Links
- duplicates
-
SPARK-21610 Corrupt records are not handled properly when creating a dataframe from a file
- Resolved