Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34422

CSV(/JSON?) files with corrupt row + Permissive mode can yield wrong partial result row

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.4.7, 3.0.1, 3.1.1
    • None
    • Spark Core
    • None

    Description

      (This was actually found and fixed in spark-xml, which copied some Spark code for handling bad records. See https://github.com/databricks/spark-xml/issues/517 )

      When CSV parsing (or, I think JSON?) encounters a bad record, in Permissive mode, it can return a partial result of values that were successfully parsed, along with the problem input in a new 'corrupt record' column.

      However the logic in FailureSafeParser that copies the partial results to the resulting Row has an off-by-one error that arises when the catalyst projection puts the 'corrupt record' column anywhere but the last column, which can readily happen. This could mean the resulting partial results are wrong, or, that processing the bad record in permissive mode fails entirely, if the resulting elements don't happen to match the schema of the result.

      The partial results are usually not that useful, so being wrong isn't a huge deal, but, failing entirely in permissive mode is a problem.

      Attachments

        Activity

          People

            srowen Sean R. Owen
            srowen Sean R. Owen
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: