[SPARK-22580] Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: Input/Output
Labels:
None
Environment:

Same behavior on Debian and MS Windows (8.1) system. JRE 1.8

Description

It seems that doing counts after filtering for the parser-created columnNameOfCorruptRecord and doing a count afterwards does not recognize any invalid row that was put to this special column.

Filtering for members of the actualSchema works fine and yields correct counts.

Input CSV example:

val1, cat1, 1.337
val2, cat1, 1.337
val3, cat2, 42.0
some, invalid, line

Code snippet:

        StructType schema = new StructType(new StructField[] { 
                new StructField("s1", DataTypes.StringType, true, Metadata.empty()),
                new StructField("s2", DataTypes.StringType, true, Metadata.empty()),
                new StructField("d1", DataTypes.DoubleType, true, Metadata.empty()),
                new StructField("FALLBACK", DataTypes.StringType, true, Metadata.empty())});
            Dataset<Row> csv = sqlContext.read()
                    .option("header", "false")
                    .option("parserLib", "univocity")
                    .option("mode", "PERMISSIVE")
                    .option("maxCharsPerColumn", 10000000)
                    .option("ignoreLeadingWhiteSpace", "false")
                    .option("ignoreTrailingWhiteSpace", "false")
                    .option("comment", null)
                    .option("header", "false")
                    .option("columnNameOfCorruptRecord", "FALLBACK")
                    .schema(schema)
                    .csv(path/to/csv/file);
             long validCount = csv.filter("FALLBACK IS NULL").count();
             long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();

Expected:
validCount is 3
Invalid Count is 1

Actual:
validCount is 4
Invalid Count is 0

Caching the csv after load solves the problem and shows the correct counts.

Attachments

Issue Links

duplicates

SPARK-21610 Corrupt records are not handled properly when creating a dataframe from a file

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Florian Kaspar

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 22/Nov/17 12:04

Updated:: 12/Dec/22 18:10

Resolved:: 22/Nov/17 14:50