Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
The ORCWriter's new self tuning feature leads to slower write frequency when it comes to ingesting datasets with a low volume of records.
This is primarily caused by the assumption that the native ORC writer will be saturated, which leads to the memory footprint of STRIPE_SIZE + avgSizeOfRecord*rowsBetweenMemoryCheck.
However, this is generally not the case when there are only a few records to write due to a low volume dataset, and causes slow writes. We should utilize a newer API on ORCWriter brought in by https://github.com/apache/orc/pull/1057