Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-2091

Enhance data loading performance by specifying range bounds for sort columns

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.1
    • None
    • None

    Description

      Currently in carbondata, data loading using node_sort (also known as local_sort) has the following procedures:

      1. convert the input data in batch. (Convert)
      2. sort the batch and write to the sort temp files. (TempSort)
      3. combine the sort temp files and do merge sort to get a bigger ordered sort temp file. (MergeSort)
      4. combine all the sort temp files and do a final sort, its results will feed the next procedure. (FinalSort)
      5. get rows in order and convert rows to carbondata columnar format pages. (produce)
      6. Write bundles of pages to files and write the corresponding index file. (consume)

      The Step1~Step3 are done concurrently using multi-thread. The Step4 is done using only one thread. The Step5 is done using multi-thread. So the Step4 is the bottleneck among all the procedures. When observing the data loading performance, we can see that the CPU usage after Step3 is low.

       

      We can enhance the data loading performance by parallelizing Step4.

       

      User can specify range bounds for the sort columns and carbondata internally distributes the records to different ranges and process the data concurrently in different ranges.

      Attachments

        Issue Links

          Activity

            People

              xuchuanyin Chuanyin Xu
              xuchuanyin Chuanyin Xu
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 8h 40m
                  8h 40m