Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28079

CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.3.2, 2.4.3
    • None
    • Spark Core
    • None

    Description

      When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged as such and read in. Only way to get them flagged is to manually set "columnNameOfCorruptRecord" AND manually setting the schema including this column. Example:

      // Second row has a 4th column that is not declared in the header/schema
      val csvText = s"""
                       | FieldA, FieldB, FieldC
                       | a1,b1,c1
                       | a2,b2,c2,d*""".stripMargin
      
      val csvFile = new File("/tmp/file.csv")
      FileUtils.write(csvFile, csvText)
      
      val reader = sqlContext.read
        .format("csv")
        .option("header", "true")
        .option("mode", "PERMISSIVE")
        .option("columnNameOfCorruptRecord", "corrupt")
        .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")
      
      reader.load(csvFile.getAbsolutePath).show(truncate = false)
      

      This produces the correct result:

      +------------+------+------+------+
      |corrupt     |fieldA|fieldB|fieldC|
      +------------+------+------+------+
      |null        | a1   |b1    |c1    |
      | a2,b2,c2,d*| a2   |b2    |c2    |
      +------------+------+------+------+
      

      However removing the "schema" option and going:

      val reader = sqlContext.read
        .format("csv")
        .option("header", "true")
        .option("mode", "PERMISSIVE")
        .option("columnNameOfCorruptRecord", "corrupt")
      
      reader.load(csvFile.getAbsolutePath).show(truncate = false)
      

      Yields:

      +-------+-------+-------+
      | FieldA| FieldB| FieldC|
      +-------+-------+-------+
      | a1    |b1     |c1     |
      | a2    |b2     |c2     |
      +-------+-------+-------+
      

      The fourth value "d*" in the second row has been removed and the row not marked as corrupt

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jimenefe F Jimenez
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: