Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-1366

When sort_scope=global_sort, use 'StorageLevel.MEMORY_AND_DISK_SER' instead of 'StorageLevel.MEMORY_AND_DISK' for 'convertRDD' persisting to improve loading performance

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.2.0
    • 1.2.0
    • None

    Description

      My testing env and configs are as followings:

      Env:
      6 executors, 9G mem + 6 cores per executor

      Configs:
      SINGLE_PASS=true
      SORT_SCOPE=GLOBAL_SORT
      spark.memory.fraction=0.5

      if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK_SER)' in method 'org.apache.carbondata.spark.load.DataLoadProcessBuilderOnSpark.loadDataUsingGlobalSort', it takes about 7.2 min to load 144136697 lines (10.9 G parquet files), and if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK)', it takes about 9.5 min to load 144136697 lines.

      Attachments

        Issue Links

          Activity

            People

              zzcclp Zhichao Zhang
              zzcclp Zhichao Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 20m
                  4h 20m