Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34042

Column pruning is not working as expected for PERMISIVE mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.7
    • None
    • Java API
    • None

    Description

      In PERMISSIVE mode

      Given a csv with multiple columns per row, if your file schema has a single column and you are doing a SELECT in SQL with a condition like '<corrupt_record_column_name> is null', the row is marked as corrupted

       

      BUT if you add an extra column in the file schema and you are not putting that column in SQL SELECT , the row is not marked as corrupted

       

      PS. I don't know exactly what is the right behavior, I didn't find it for PERMISSIVE mode the documentation.

      What I found is: As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, the selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier, it is empty in the DROPMALFORMED mode. To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.

       

      https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html

       

      I made a "unit" test in order to exemplify the issue: https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            marius.butan Marius Butan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: