[TEZ-3391] Optimize single split MR split reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.10.0, 0.9.3
Component/s: None
Labels:
None

Description

During initialization, each task creates an array of objects TaskSplitMetaInfo[]. This represents unnecessary space and time overhead as each task needs only its corresponding split object. Beside the current implementation is n^2 space complexity, it leaks the inputstream.

We need to optimize that implementation by returning only a single object instead of an entire array.

rohini suggested the following:

In the vertex construct TaskSplitMetaInfo only for the split of that task instead of constructing for all splits. ie change
public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, FileSystem fs) to public static TaskSplitMetaInfo getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip reading splits below the index. If there are 1000 splits, the first task will read 1 split, second task will read 2 splits and so on instead of each task reading all the 1000 splits as is happening now.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TEZ-3391.002.patch
22/Jan/20 14:31
8 kB
Ahmed Hussein
TEZ-3391.001.patch
21/Jan/20 21:56
9 kB
Ahmed Hussein

Activity

People

Assignee:: Ahmed Hussein

Reporter:: Rohini Palaniswamy

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 01/Aug/16 22:15

Updated:: 22/Jan/20 22:58

Resolved:: 22/Jan/20 22:58