Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.12.0
-
None
-
None
Description
Recently moved from parquet 1.8.x to 1.12 recently.
Dataset has > 20k null values to be written to a complex type. Earlier with 1.8.x, it would create single page but with 1.12 it creates 20 pages (parquet - 1414). Writing nulls to complex types has been optimised to be cached (null cache) that would be flushed on next non null encounter or explicit flush/close. With 1.8, it would have encountered explicit close and flush the null cache and write the page. But with 1.12, after encountering 20k values, the page is written prematurely.
Below is the metadata dump in both cases.
1.8 :
index._id TV=111396 RL=0 DL=2 ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[num_nulls: 111396, min/max not defined] SZ:8 VC:111396
1.12 :
index._index TV=111396 RL=0 DL=2 ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:4 VC:0 ...... page 19: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:8 VC:111396
All the pages in 1.12 except the last page have same metadata. Now the issue is when the parquet reader kicks in, it sees that the RLE is bit packed and reads 8 bytes which goes beyond the stream as the size is only 4 (Reading past RLE/BitPacking stream).