Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
Description
Context :
Submitted a spark job to read 3-4B ORC records and wrote to Hudi format. Creating the following table with all the runs that I had carried out based on different options
CONFIG | Number of Files Created | Size of each file |
---|---|---|
PARQUET_FILE_MAX_BYTES=DEFAULT | 30K | 21MB |
PARQUET_FILE_MAX_BYTES=1GB | 3700 | 178MB |
PARQUET_FILE_MAX_BYTES=1GB COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=1100000 |
Same as before | Same as before |
PARQUET_FILE_MAX_BYTES=1GB BULKINSERT_PARALLELISM=100 |
Same as before | Same as before |
PARQUET_FILE_MAX_BYTES=4GB | 1600 | 675MB |
PARQUET_FILE_MAX_BYTES=6GB | 669 | 1012MB |
Based on this runs, it feels that the compression ratio is off.
Attachments
Issue Links
- relates to
-
HUDI-64 Estimation of compression ratio & other dynamic storage knobs based on historical stats
- Open