Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22580

Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: Input/Output
    • Labels:
      None
    • Environment:

      Same behavior on Debian and MS Windows (8.1) system. JRE 1.8

      Description

      It seems that doing counts after filtering for the parser-created columnNameOfCorruptRecord and doing a count afterwards does not recognize any invalid row that was put to this special column.

      Filtering for members of the actualSchema works fine and yields correct counts.

      Input CSV example:

      val1, cat1, 1.337
      val2, cat1, 1.337
      val3, cat2, 42.0
      some, invalid, line
      

      Code snippet:

              StructType schema = new StructType(new StructField[] { 
                      new StructField("s1", DataTypes.StringType, true, Metadata.empty()),
                      new StructField("s2", DataTypes.StringType, true, Metadata.empty()),
                      new StructField("d1", DataTypes.DoubleType, true, Metadata.empty()),
                      new StructField("FALLBACK", DataTypes.StringType, true, Metadata.empty())});
                  Dataset<Row> csv = sqlContext.read()
                          .option("header", "false")
                          .option("parserLib", "univocity")
                          .option("mode", "PERMISSIVE")
                          .option("maxCharsPerColumn", 10000000)
                          .option("ignoreLeadingWhiteSpace", "false")
                          .option("ignoreTrailingWhiteSpace", "false")
                          .option("comment", null)
                          .option("header", "false")
                          .option("columnNameOfCorruptRecord", "FALLBACK")
                          .schema(schema)
                          .csv(path/to/csv/file);
                   long validCount = csv.filter("FALLBACK IS NULL").count();
                   long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();
      

      Expected:
      validCount is 3
      Invalid Count is 1

      Actual:
      validCount is 4
      Invalid Count is 0

      Caching the csv after load solves the problem and shows the correct counts.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                OL_Flogge Florian Kaspar
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: