[CARBONDATA-1366] When sort_scope=global_sort, use 'StorageLevel.MEMORY_AND_DISK_SER' instead of 'StorageLevel.MEMORY_AND_DISK' for 'convertRDD' persisting to improve loading performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: 1.2.0
Component/s: data-load, spark-integration
Labels:
None

Description

My testing env and configs are as followings:

Env:
6 executors, 9G mem + 6 cores per executor

Configs:
SINGLE_PASS=true
SORT_SCOPE=GLOBAL_SORT
spark.memory.fraction=0.5

if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK_SER)' in method 'org.apache.carbondata.spark.load.DataLoadProcessBuilderOnSpark.loadDataUsingGlobalSort', it takes about 7.2 min to load 144136697 lines (10.9 G parquet files), and if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK)', it takes about 9.5 min to load 144136697 lines.

Attachments

Issue Links

links to

GitHub Pull Request #1245

Activity

People

Assignee:: Zhichao Zhang

Reporter:: Zhichao Zhang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 08/Aug/17 09:42

Updated:: 11/Aug/17 02:59

Resolved:: 11/Aug/17 02:59

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

4h 20m