Formats like Parquet and ORC are great at compressing data and making it fast to scan/filter/project the data.
However, these formats are only efficient, if they can columnarize and compress a significant amount of data in their columnar format. If they compress only a few rows at a time, they produce many short column vecors and are thus much less efficient.
The Bucketing Sink has the requirement that data is persistent on the target FileSystem on each checkpoint.
Pushing data through a Parquet or ORC encoder and flushing on each checkpoint means that for frequent checkpoints, the amount of data compressed/columnarized in a block is small. Hence, the result is an inefficiently compressed file.
Making this efficient independently of the checkpoint interval would mean that the sink needs to first collect (and persist) a good amount of data and then push it through the Parquet/ORC writers.
I would suggest to approach this as follows:
- When writing to the "in progress files" write the raw records (TypeSerializer encoding)
- When the "in progress file" is rolled over (published), the sink pushes the data through the encoder.
- This is not much work on top of the new abstraction and will result in large blocksand hence in efficient compression.
Alternatively, we can support directly encoding the stream to the "in progress files" via Parque/ORC, if users know that their combination of data rate and checkpoint interval will result in large enough chunks of data per checkpoint interval.