[CARBONDATA-1805] Optimize pruning for dictionary loading - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.3.0
Component/s: data-load, spark-integration
Labels:
None

Description

SCENARIO

Recently I tried dictionary feature in Carbondata and found its dictionary generating phase in data loading is quite slow. My scenario is as below:

+ Input Data: 35.8GB CSV file with 199 columns and 126 Million lines

+ Dictionary columns: 3 columns each containing 19213,4,9 distinct values

The whole data loading consumes about 2.9min for dictionary generating and 4.6min for fact data loading – about 39% of the time are spent on dictionary.

Having observed the nmon result, Ifound the CPU usage were quite high during the dictionary generating phase and the Disk, Network were quite normal.

ANALYZE

After I went through the dictionary generating related code, I found Carbondata aleady prune non-dictionary columns before generating dictionary. But the problem is that `the pruning comes after data file reading`, this will cause some overhead, we can optimize it by `prune while reading data file`.

RESOLVE

Refactor the `loadDataFrame` method in `GlobalDictionaryUtil`, only pruning the non-dictionary columns while reading the data file.

After implementing the above optimization, the dictionary generating costs only `29s` – `about 6 times better than before`(2.9min), and the fact data loading costs the same as before(4.6min), about 10% of the time are spent on dictionary.

NOTE

+ Currently only `load data file` will benefit from this optimization, while `load data frame` will not.

+ Before implementing this solution, I tried another solution – cache dataframe of the data file, the performance was even worse – the dictionary generating time was 5.6min.

Attachments

Issue Links

links to

GitHub Pull Request #1559

Activity

People

Assignee:: Chuanyin Xu

Reporter:: Chuanyin Xu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Nov/17 06:29

Updated:: 18/Dec/17 08:25

Resolved:: 18/Dec/17 08:25

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

11h 20m