Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
There are several imprecise/inconsistent formulations in the specification of the Parquet Delta Encoding (https://github.com/apache/parquet-format/blob/master/Encodings.md).
- In the beginning of the Delta Encoding section, it is written that
When there are not enough values to encode a full block we pad with zeros (added to the frame of reference).
From the parquet-mr implementation of Delta Encoding (https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-column/src/main/java/org/apache/parquet/column/values/delta/DeltaBinaryPackingValuesWriterForInteger.java), it seems that when the number of elements does not fill a complete miniblock, we do use padding (otherwise the data would not always end on a byte boundary), but that short blocks are not padded, i.e. we do not add empty/unspecified miniblocks to the block and do not even set the bit width to zero for the remaining miniblocks (which is not very good in my opinion). The specification should be clearer on this point.
- In the description of the header, it is written that
the block size is a multiple of 128 stored as VLQ int
According to Wikipedia, VLQ is big-endian and the corresponding little-endian encoding us ULEB128 (https://en.wikipedia.org/wiki/Variable-length_quantity, https://en.wikipedia.org/wiki/LEB128). The parquet-mr implementation uses the little-endian format. The number encoding is called VLQ in the whole Delta Encoding specification, not just here.
As the implementaion is already in use, the best would be to update the specification to match the implementation. - The next line is:
the miniblock count per block is a diviser of the block size stored as VLQ int the number of values in the miniblock is a multiple of 32.
This should be stylistically improved. Also, divisor is spelled with an ‘o’. For example:
the miniblock count per block is a divisor of the block size such that their quotient, the number of values in a miniblock, is a multiple of 32; it is stored as a ULEB128 int
- In the section describing the block:
the min delta is a VLQ int
I think it should be more precise and say that the min delta is a zigzag VLQ/ULEB int as plain VLQ and ULEB are unsigned and the zigzag version is actually used in parquet-mr.
- Later in the same section:
Having multiple blocks allows us to escape values and restart from a new base value.
The reader may think that in each block, we have a new base value according to which we compute the delta of the next element, but it is not true. The base value is the very first value in the page, which is stored in the header. What the author meant is that we have a new min delta in each block that is the frame of reference for the deltas in that block (we subtract it from the deltas to make them non-negative), but in my opinion it is not clear from this sentence.
- In the section describing the algorithm to encode the values (beginning with “To encode each delta block...“), in step 2, it says this:
Encode the first value as zigzag VLQ int
This is misleading as we do not store the first value of each block as a VLQ/ULEB int, only the very first value in the page is stored in such a way, in the header, not in each block. Generally I think the description of the algorithm could be more straightforward, I find it a little difficult to understand.
- In the examples, the block sizes are not multiples of 128, but the specification requires that. Either the examples should be replaced with valid ones or it should be noted that this is to keep the examples shorter. Also, it would be useful to include examples with multiple blocks.
- In the ‘Characteristics’ section, miniblock is written in two words, while in the rest of the specification, it is written as one.