Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0
    • Component/s: core, data-load
    • Labels:
      None

      Description

      1. Scenario

      Currently we have done a massive data loading. The input data is about 71GB in CSV format,and have about 88million records. When using carbondata, we do not use any dictionary encoding. Our testing environment has three nodes and each of them have 11 disks as yarn executor directory. We submit the loading command through JDBCServer.The JDBCServer instance have three executors in total, one on each node respectively. The loading takes about 10minutes (+-3min vary from each time).

      We have observed the nmon information during the loading and find:

      1. lots of CPU waits in the first half of loading;

      2. only one single disk has many writes and almost reaches its bottleneck (Avg. 80M/s, Max. 150M/s on SAS Disk)

      3. the other disks are quite idel

      1. Analyze

      When do data loading, carbondata read and sort data locally(default scope) and write the temp files to local disk. In my case, there is only one executor in one node, so carbondata write all the temp file to one disk(container directory or yarn local directory), thus resulting into single disk hotspot.

      1. Modification

      We should support multiple directory for writing temp files to avoid disk hotspot.

      Ps: I have improved this in my environment and the result is pretty optimistic: the loading takes about 6minutes (10 minutes before improving).

        Issue Links

          Activity

          Hide
          Bjangir Babulal added a comment -

          Hi
          Can you please try option carbon.tempstore.locations
          In carbon.properties . It will accept mutiple disks to local store/sort.

          Thanks
          Babu

          Show
          Bjangir Babulal added a comment - Hi Can you please try option carbon.tempstore.locations In carbon.properties . It will accept mutiple disks to local store/sort. Thanks Babu
          Hide
          xuchuanyin xuchuanyin added a comment - Reporter

          Babulal I've checked the code, and cannot find the property `carbon.tempstore.locations`.

          Do you mean the property `carbon.tempstore.location` in carbondata source code? This property does not resolve the hotspot problem while do loading.

          Show
          xuchuanyin xuchuanyin added a comment - Reporter Babulal I've checked the code, and cannot find the property `carbon.tempstore.locations`. Do you mean the property `carbon.tempstore.location` in carbondata source code? This property does not resolve the hotspot problem while do loading.
          Hide
          xuchuanyin xuchuanyin added a comment - - edited Reporter

          Here I will provide the configuration used in my test for others to reference.

          1. ENV

          3 HUAWEI RH2288 nodes, each has 24 Cores(E5-2667@2.90GHz), 256GB MEM, 11 Disks(SAS)

          We use JDBCServer to do loading test.
          We have 4 executor in total (3 executor on each node + 1 driver executor).
          executor: 20 cores, 128GB per exector
          driver executor: 1 core, 20GB

          1. USE CASE

          88Million Recods with CSV format

          340+ columns per record

          NO Dictionary column

          TABLE_BLOCKSIZE 64

          INVERTED_INDEX about 9 columns

          1. CONF

          parameter value origin-value
          carbon.number.of.cores 20
          carbon.number.of.cores.while.loading 14
          sort.inmemory.size.inmb 2048 1024
          offheap.sort.chunk.size.inmb 128 64
          carbon.sort.intermediate.files.limit 20 20
          carbon.sort.file.buffer.size 50 20
          carbon.use.local.dir true false
          carbon.use.multiple.dir true false

          1. RESULT

          Using `LOAD DATA INPATH `, the loading cost about 6min

          Observing the NMON, each disk IO usage is quite average.

          Show
          xuchuanyin xuchuanyin added a comment - - edited Reporter Here I will provide the configuration used in my test for others to reference. ENV 3 HUAWEI RH2288 nodes, each has 24 Cores(E5-2667@2.90GHz), 256GB MEM, 11 Disks(SAS) We use JDBCServer to do loading test. We have 4 executor in total (3 executor on each node + 1 driver executor). executor: 20 cores, 128GB per exector driver executor: 1 core, 20GB USE CASE 88Million Recods with CSV format 340+ columns per record NO Dictionary column TABLE_BLOCKSIZE 64 INVERTED_INDEX about 9 columns CONF parameter value origin-value carbon.number.of.cores 20 carbon.number.of.cores.while.loading 14 sort.inmemory.size.inmb 2048 1024 offheap.sort.chunk.size.inmb 128 64 carbon.sort.intermediate.files.limit 20 20 carbon.sort.file.buffer.size 50 20 carbon.use.local.dir true false carbon.use.multiple.dir true false RESULT Using `LOAD DATA INPATH `, the loading cost about 6min Observing the NMON, each disk IO usage is quite average.

            People

            • Assignee:
              xuchuanyin xuchuanyin
              Reporter:
              xuchuanyin xuchuanyin
              Request participants:
              None
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 17h 40m
                17h 40m