Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-1898

Improve performance of Selftune ORC Writer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • gobblin-core
    • None

    Description

      The ORCWriter's new self tuning feature leads to slower write frequency when it comes to ingesting datasets with a low volume of records.

      This is primarily caused by the assumption that the native ORC writer will be saturated, which leads to the memory footprint of STRIPE_SIZE + avgSizeOfRecord*rowsBetweenMemoryCheck.

      However, this is generally not the case when there are only a few records to write due to a low volume dataset, and causes slow writes. We should utilize a newer API on ORCWriter brought in by https://github.com/apache/orc/pull/1057

      Attachments

        Activity

          People

            abti Abhishek Tiwari
            wlo William Lo
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m