Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-11414

Off-by-one error in Parquet late materialization

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • Impala 4.2.0, Impala 4.1.1
    • Backend
    • None
    • ghx-label-6

    Description

      With PARQUET_LATE_MATERIALIZATION we can set the number of minimum consecutive rows that if filtered out, we avoid materialization of rows in other columns in parquet.

      E.g. if PARQUET_LATE_MATERIALIZATION is 10, and in a filtered column we find at least 10 consecutive rows that don't pass the predicates we avoid materializing the corresponding rows in the other columns.

      But due to an off-by-one error we actually only need (PARQUET_LATE_MATERIALIZATION - 1) consecutive elements. This means if we set PARQUET_LATE_MATERIALIZATION to one, then we need zero consecutive filtered out elements which leads to a crash/DCHECK. The bug is in the GetMicroBatches() algorithm when we produce the micro batches based on the selected rows.

      Setting PARQUET_LATE_MATERIALIZATION to 0 doesn't make sense so it shouldn't be allowed.

      Attachments

        Activity

          People

            boroknagyz Zoltán Borók-Nagy
            boroknagyz Zoltán Borók-Nagy
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: