Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-1805

Optimize pruning for dictionary loading

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.3.0
    • None

    Description

      1. SCENARIO

      Recently I tried dictionary feature in Carbondata and found its dictionary generating phase in data loading is quite slow. My scenario is as below:

      + Input Data: 35.8GB CSV file with 199 columns and 126 Million lines

      + Dictionary columns: 3 columns each containing 19213,4,9 distinct values

      The whole data loading consumes about 2.9min for dictionary generating and 4.6min for fact data loading – about 39% of the time are spent on dictionary.

      Having observed the nmon result, Ifound the CPU usage were quite high during the dictionary generating phase and the Disk, Network were quite normal.

      1. ANALYZE

      After I went through the dictionary generating related code, I found Carbondata aleady prune non-dictionary columns before generating dictionary. But the problem is that `the pruning comes after data file reading`, this will cause some overhead, we can optimize it by `prune while reading data file`.

      1. RESOLVE

      Refactor the `loadDataFrame` method in `GlobalDictionaryUtil`, only pruning the non-dictionary columns while reading the data file.

      After implementing the above optimization, the dictionary generating costs only `29s` – `about 6 times better than before`(2.9min), and the fact data loading costs the same as before(4.6min), about 10% of the time are spent on dictionary.

      1. NOTE

      + Currently only `load data file` will benefit from this optimization, while `load data frame` will not.

      + Before implementing this solution, I tried another solution – cache dataframe of the data file, the performance was even worse – the dictionary generating time was 5.6min.

      Attachments

        Issue Links

          Activity

            People

              xuchuanyin Chuanyin Xu
              xuchuanyin Chuanyin Xu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 11h 20m
                  11h 20m