Description
RleBatchDecoder's error handling should be stricter and more explicit in my opinion. My main problem with the current solution is readability - it is hard to tell at the first glance what will be the consequences of different decoding errors.
There are 4 things that can go wrong during RLE decoding:
1. repeated / literal run found with count 0 (now this leads to error in some cases, while is skipped in some others)
2. literal run that exceeds the input (currently GetLiteralValues returns false in this case)
3. more values are expected, but there are no more runs in the input (currently this has to be checked by the caller)
4. there shouldn't be any more value, but the input still has more bytes (this is currently not checked)
1 and 2 should set RleBatchDecoder to an error state, while there should be a function to check 4 (e.g. STATUS CheckIfAtTheEnd() ). Handling of 3 is ok as it is.
Changing 1 and 4 would mean that some Parquet pages that are currently accepted would be considered corrupt.
- 1 is a valid change in my opinion, as parquet-mr also doesn't handle 0 counts - 0 sized literal runs wil lead to exception, while 0 sizes repeated runs will lead to incorrect results, see https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
- 4 could be useful to detect errors in encoders / corrupted data - nearly any random buffer can be decoded as RLE, but it is unlikely that the number of encoded values will match the number of expected values