Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-1891

Create self-tuning ORC Writer

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • gobblin-core
    • None

    Description

      In Gobblin streaming, the Avro to ORC converter and writer constantly face OOM issues when the record sizes are large due to large arrays or maps.
      Since streaming pipelines are run indefinitely*, static configurations are usually insufficient to handle varying sizes of data, the converter buffers, increases in partitions, etc. This causes pipelines to often stall and make no progress if the incoming data size is increased beyond the memory limits of the container.

      We want to implement a bufferedORCWriter, which utilizes many of the same components as the current ORC Writer, except that the batchSize is adaptable to larger record sizes and takes into the account of the memory available to the JVM to avoid OOM issues as well as the memory the converter uses, and the number of partitioned writers. This should be enabled only by a configuration, and have knobs available so that one can increase the sensitivity and the performance of this writer.

      Future improvements include improving the converter to use up less unused memory every resize, and more accurate estimations done for memory usage in the orc writer.

      Attachments

        Activity

          People

            abti Abhishek Tiwari
            wlo William Lo
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 20m
                1h 20m