[CARBONDATA-2023] Optimization in data loading for skewed data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.0
Fix Version/s: None
Component/s: data-load
Labels:
None

Description

In one of my cases, carbondata has to load skewed data files. The size of data file ranges from 1KB to about 5GB.

In current implementation, carbondata will distribute the file blocks(splits) among the nodes to maximum the data locality and data evenly distributed, we call it `block-node-assignment` for short.

However, the current implementation has some problems.

The assignment is block number based. The goal is to make sure that all the nodes deal the same amount number of blocks. In the skewed data scenario described above, the block of a small file and the block of a big file are very different from its size (1KB v.s. 64MB). As a result, the difference of total data size assigned for each data node is very large.

In order to solve this problem, the size of block should be considered during the block-node-assignment. One node can deal more blocks than another as long as the total size of blocks are almost the same.

Attachments

Issue Links

links to

GitHub Pull Request #1808

Activity

People

Assignee:: Chuanyin Xu

Reporter:: Chuanyin Xu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Jan/18 07:12

Updated:: 13/Apr/18 07:13

Resolved:: 13/Apr/18 07:13

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

16h 40m