Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8431

Parquet STRING column memory reservation seems underestimated

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: Impala 3.2.0
    • Fix Version/s: None
    • Component/s: Frontend
    • Epic Color:
      ghx-label-2

      Description

      https://github.com/apache/impala/blob/5fa076e95cfbfcc044dc14cbb20af825936af82a/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L1698

      computeMinScalarColumnMemReservation() uses stat avg_size to estimate the memory needed for a value during scanning, but this does not contain the 4 byte / value length field used in plain encoding, which can dominate columns with very short strings. (compression can probably negate this affect)

      In case of dict decoding estimation:

      • this 4 byte/NDV should be also added, as the dictionary itself is also plain encoded
      • the backend used + 12 byte/NDV for the StringValues used as indirection in the dictionary, but I am not sure if this should be added to the reservation
      • a more pessimistic estimation would use max_size instead of avg_size for dictionary entries, as it is possible that the majority of distinct values are long, but the short ones are much more frequent, which makes the avg_size small

      Another small underestimation, that NULL values are ignored. NULLs (=def levels) could be added as 1 bit/value.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              csringhofer Csaba Ringhofer
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: