[HADOOP-7404] Data Blocks Spliting should be record oriented or provided option for give the spliting locations (offsets) as input file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Old Bug : https://issues.apache.org/jira/browse/HADOOP-106

It is difficult to do the padding in the existing records. Due to the following reason:

1. Records are having the different Size (some may be bytes, some may be GB) but in same file.
2. It is having the compatibility issues with the other standard tools.
3. It will increases the file size without any need of other tools (not working on hadoop).

I think there should be option to this splitting process like this:-

1. File contains information of offsets where should be splitting done. (like 10,100,120, offset it).
2. Hadoop should do the splitting according to it ( 10-0 = 10, 100-10 =90 , etc).
3. This file can be generated easily from the other tools.

Attachments

Issue Links

is related to

HADOOP-106 Data blocks should be record-oriented.

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Sunil Goyal

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Jun/11 08:19

Updated:: 20/Jun/11 18:34