Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
- SCENARIO
Recently I tried dictionary feature in Carbondata and found its dictionary generating phase in data loading is quite slow. My scenario is as below:
+ Input Data: 35.8GB CSV file with 199 columns and 126 Million lines
+ Dictionary columns: 3 columns each containing 19213,4,9 distinct values
The whole data loading consumes about 2.9min for dictionary generating and 4.6min for fact data loading – about 39% of the time are spent on dictionary.
Having observed the nmon result, Ifound the CPU usage were quite high during the dictionary generating phase and the Disk, Network were quite normal.
- ANALYZE
After I went through the dictionary generating related code, I found Carbondata aleady prune non-dictionary columns before generating dictionary. But the problem is that `the pruning comes after data file reading`, this will cause some overhead, we can optimize it by `prune while reading data file`.
- RESOLVE
Refactor the `loadDataFrame` method in `GlobalDictionaryUtil`, only pruning the non-dictionary columns while reading the data file.
After implementing the above optimization, the dictionary generating costs only `29s` – `about 6 times better than before`(2.9min), and the fact data loading costs the same as before(4.6min), about 10% of the time are spent on dictionary.
- NOTE
+ Currently only `load data file` will benefit from this optimization, while `load data frame` will not.
+ Before implementing this solution, I tried another solution – cache dataframe of the data file, the performance was even worse – the dictionary generating time was 5.6min.
Attachments
Issue Links
- links to