[HADOOP-3144] better fault tolerance for corrupted text files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.15.3
Fix Version/s: 0.18.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..

so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

3144-7.patch
06/May/08 06:44
8 kB
Zheng Shao
3144-6.patch
06/May/08 01:11
7 kB
Zheng Shao
3144-5.patch
05/May/08 21:26
7 kB
Zheng Shao
3144-4.patch
01/May/08 21:29
8 kB
Zheng Shao
3144-ignore-spaces-3.patch
22/Apr/08 18:14
8 kB
Zheng Shao
3144-ignore-spaces-2.patch
09/Apr/08 09:53
8 kB
Zheng Shao

Activity

People

Assignee:: Zheng Shao

Reporter:: Joydeep Sen Sarma

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Mar/08 23:50

Updated:: 08/Jul/09 16:52

Resolved:: 06/May/08 20:22