Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-742

Add batch sort to improve the loading performance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.1.0
    • None
    • None

    Description

      Current Problem:
      Sort step is major issue as it is blocking step. It needs to receive all data and write down the sort temp files to disk, after that only data writer step can start.

      Solution:
      Make sort step as non blocking step so it avoids waiting of Data writer step.
      Process the data in sort step in batches with size of in-memory capability of the machine. For suppose if machine can allocate 4 GB to process data in-memory, then Sort step can sorts the data with batch size of 2GB and gives it to the data writer step. By the time data writer step consumes the data, sort step receives and sorts the data. So here all steps are continuously working and absolutely there is no disk IO in sort step.

      So there would not be any waiting of data writer step for sort step, As and when sort step sorts the data in memory data writer can start writing it.
      It can significantly improves the performance.

      Advantages:
      Increases the loading performance as there is no intermediate IO and no blocking of Sort step.
      There is no extra effort for compaction, the current flow can handle it.

      Disadvantages:
      Number of driver side btrees will increase. So the memory might increase but it could be controlled by current LRU cache implementation.

      Attachments

        Issue Links

          Activity

            People

              ravi.pesala Ravindra Pesala
              ravi.pesala Ravindra Pesala
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 8h 20m
                  8h 20m