Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-3976

CarbonData Update operation enhancement

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • data-load
    • None

    Description

      Background
      Update operation will clean up delta files before update( see
      cleanUpDeltaFiles(carbonTable, false)), It's loop traversal metadata path
      and segment path many times. When there are too many files, the overhead
      will increase and update time will be longer.

      Motivation & Goal
      During the update process, reduce loop traversal or remove cleanUpDelteFiles
      to another method.

      Modification
      There are some solutions as following.

      Solution 1:

      In cleanUpDeltaFiles have some same points in get files method, like
      updateStatusManager.getUpdateDeltaFilesList(segment,
      false,CarbonCommonConstants.UPDATE_DELTA_FILE_EXT, true,
      allSegmentFiles,true) and
      updateStatusManager.getUpdateDeltaFilesList(segment,
      false,CarbonCommonConstants.UPDATE_INDEX_FILE_EXT, true,
      allSegmentFiles,true), They are just different file types,but loop traversal
      segment path twice. we can merge it.

      Solution 2:

      Base solution 1,Use Spark or MapReduce to hand over tasks to other nodes.

      Solution 3:

      Submit cleanUpDelaFiles  to another task, process them in the early morning
      or when the cluster is not busy.

      Solution 4:

      Establish a garbage collection bin, which provides some interfaces for our
      program to determine when files enter the garbage collection bin and how to
      deal with them.

      Please vote for all solutions.

      Best Regards,
      LinWood

      Attachments

        Activity

          People

            Unassigned Unassigned
            Linwood TangLin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 20m
                1h 20m