Details
-
Wish
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
While using ParquetWriter and before closing it to write the content out to the disk, there is no way to check/estimate the size of the output file. This is useful in case we want to close files and upload them based on a minimum size threshold. Since ParquetWriter keeps everything in memory and only writes it out to disk at the very end when writer is closed, it is not possible to have an estimate of the output file size before closing the writer.
Based on Parquet documentation, the data is written into memory object in the final format, meaning that the size of the object in memory is the very close to the final size on disk. it would be great if you can expose the current size of the parquetWriter object in memory. It is true that such a size will be different than the final output size because of adding the schema and other metadata at the end of the file but it still gives a close estimation of the output file size that will be very useful when reading/writing streams.