Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11153

Turns off Parquet filter push-down for string and binary columns

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.5.0, 1.5.1
    • Fix Version/s: 1.5.2, 1.6.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:

      Description

      Due to PARQUET-251, BINARY columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.

      Note that this kind of corrupted Parquet files could be produced by any Parquet data models.

      This affects all Spark SQL data types that can be mapped to Parquet BINARY, namely:

      • StringType
      • BinaryType
      • DecimalType (but Spark SQL doesn't support pushing down DecimalType columns for now.)

      To avoid wrong query results, we should disable filter push-down for columns of StringType and BinaryType until we upgrade to parquet-mr 1.8.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                lian cheng Cheng Lian
                Reporter:
                lian cheng Cheng Lian
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: