Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
PARQUET-306 added the ability to pad row groups so that they align with HDFS blocks to avoid remote reads. The ParquetFileWriter will now either pad the remaining space in the block or target a row group for the remaining size.
The padding maximum controls the threshold of the amount of padding that will be used. If the space left is under this threshold, it is padded. If it is greater than this threshold, then the next row group is fit into the remaining space. The current padding maximum is 0.
I think we should change the padding maximum to 8MB. My reasoning is this: we want this number to be small enough that it won't prevent the library from writing reasonable row groups, but larger than the minimum size row group we would want to write. 8MB is 1/16th of the row group default, so I think it is reasonable: we don't want a row group to be smaller than 8 MB.
We also want this to be large enough that a few row groups in a block don't cause a tiny row group to be written in the excess space. 8MB accounts for 4 row groups that are 2MB under-size. In addition, it is reasonable to not allow row groups under 8MB.