Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-1373

Enhance update performance in carbondata

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.2.0
    • data-load
    • None

    Description

      1. Scenario

      Recently I have tested the update feature provided in Carbondata and found its poor performance.

      I had a table containing about 14 million records with about 370 columns(no dictionary columns) and the data files are about 3.8 GB in total. All the data files were in one segment.

      I performed an update SQL which update a column for all the records and the SQL looked like `UPDATE myTable SET (col1)=(col1+1000) WHERE TRUE`. In my environment, the update job failed with 'executor lost errors'. And I found 'spill data' related messages in the container logs.

      1. Analyze
        I've read about the implementation of update-delete in Carbondata in ISSUE#440. The update consists a delete and an insert operation. And the error occurred during the insert operation.

      After studying the code, I have found that while doing inserting, the updated records are grouped by the `segmentId`, which means all the recoreds in one segment will be processed in only one task, thus will cause task failure when the amount of input data is quite large.

      1. Solution
        We should improve the parallelism when doing update for a segment.

      I append a random key to the `segmentId` to increase the partition number before doing the insertion stage and then remove the suffix when doing the real insertion.

      I have tested in my example and the job finished in about 13 minutes successfully. The records were updated as expected.

      Attachments

        Activity

          People

            xuchuanyin Chuanyin Xu
            xuchuanyin Chuanyin Xu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 4h 10m
                4h 10m