Details
-
New Feature
-
Status: Reopened
-
Major
-
Resolution: Unresolved
-
1.10.0
-
None
Description
Current implementation of merge command in parquet-tools doesn't merge row groups, just places one after the other. Add API and command option to be able to merge small blocks into larger ones up to specified size limit.
Implementation details:
Blocks are not reordered not to break possible initial predicate pushdown optimizations.
Blocks are not divided to fit upper bound perfectly.
This is an intentional performance optimization.
This gives an opportunity to form new blocks by coping full content of smaller blocks by column, not by row.
Examples:
- Input files with blocks sizes:
[128 | 35], [128 | 40], [120]
Expected output file blocks sizes:
{{merge }}[128 | 35 | 128 | 40 | 120]
merge -b
[128 | 35 | 128 | 40 | 120]
{{merge -b -l 256 }}
[163 | 168 | 120]
- Input files with blocks sizes:
[128 | 35], [40], [120], [6]
Expected output file blocks sizes:
merge[128 | 35 | 40 | 120 | 6]
merge -b
[128 | 75 | 126]
merge -b -l 256
[203 | 126]