Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-2023

Optimization in data loading for skewed data

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.0
    • None
    • data-load
    • None

    Description

      In one of my cases, carbondata has to load skewed data files. The size of data file ranges from 1KB to about 5GB.

      In current implementation, carbondata will distribute the file blocks(splits) among the nodes to maximum the data locality and data evenly distributed, we call it `block-node-assignment` for short.

      However, the current implementation has some problems.

      The assignment is block number based. The goal is to make sure that all the nodes deal the same amount number of blocks. In the skewed data scenario described above, the block of a small file and the block of a big file are very different from its size (1KB v.s. 64MB). As a result, the difference of total data size assigned for each data node is very large.

      In order to solve this problem, the size of block should be considered during the block-node-assignment. One node can deal more blocks than another as long as the total size of blocks are almost the same.

      Attachments

        Issue Links

          Activity

            People

              xuchuanyin Chuanyin Xu
              xuchuanyin Chuanyin Xu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 16h 40m
                  16h 40m