Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6700

Improve error handling of Parquet RLE decoding

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Backend
    • ghx-label-1

    Description

      RleBatchDecoder's error handling should be stricter and more explicit in my opinion. My main problem with the current solution is readability - it is hard to tell at the first glance what will be the consequences of different decoding errors.

      There are 4 things that can go wrong during RLE decoding:
      1. repeated / literal run found with count 0 (now this leads to error in some cases, while is skipped in some others)
      2. literal run that exceeds the input (currently GetLiteralValues returns false in this case)
      3. more values are expected, but there are no more runs in the input (currently this has to be checked by the caller)
      4. there shouldn't be any more value, but the input still has more bytes (this is currently not checked)

      1 and 2 should set RleBatchDecoder to an error state, while there should be a function to check 4 (e.g. STATUS CheckIfAtTheEnd() ). Handling of 3 is ok as it is.

      Changing 1 and 4 would mean that some Parquet pages that are currently accepted would be considered corrupt.

      Attachments

        Activity

          People

            Unassigned Unassigned
            csringhofer Csaba Ringhofer
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: