Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22580

Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.2.0
    • None
    • Input/Output
    • None
    • Same behavior on Debian and MS Windows (8.1) system. JRE 1.8

    Description

      It seems that doing counts after filtering for the parser-created columnNameOfCorruptRecord and doing a count afterwards does not recognize any invalid row that was put to this special column.

      Filtering for members of the actualSchema works fine and yields correct counts.

      Input CSV example:

      val1, cat1, 1.337
      val2, cat1, 1.337
      val3, cat2, 42.0
      some, invalid, line
      

      Code snippet:

              StructType schema = new StructType(new StructField[] { 
                      new StructField("s1", DataTypes.StringType, true, Metadata.empty()),
                      new StructField("s2", DataTypes.StringType, true, Metadata.empty()),
                      new StructField("d1", DataTypes.DoubleType, true, Metadata.empty()),
                      new StructField("FALLBACK", DataTypes.StringType, true, Metadata.empty())});
                  Dataset<Row> csv = sqlContext.read()
                          .option("header", "false")
                          .option("parserLib", "univocity")
                          .option("mode", "PERMISSIVE")
                          .option("maxCharsPerColumn", 10000000)
                          .option("ignoreLeadingWhiteSpace", "false")
                          .option("ignoreTrailingWhiteSpace", "false")
                          .option("comment", null)
                          .option("header", "false")
                          .option("columnNameOfCorruptRecord", "FALLBACK")
                          .schema(schema)
                          .csv(path/to/csv/file);
                   long validCount = csv.filter("FALLBACK IS NULL").count();
                   long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();
      

      Expected:
      validCount is 3
      Invalid Count is 1

      Actual:
      validCount is 4
      Invalid Count is 0

      Caching the csv after load solves the problem and shows the correct counts.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              OL_Flogge Florian Kaspar
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: