Issue Details (XML | Word | Printable)

Key: HADOOP-3144
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Zheng Shao
Reporter: Joydeep Sen Sarma
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

better fault tolerance for corrupted text files

Created: 31/Mar/08 11:50 PM   Updated: 08/Jul/09 04:52 PM
Component/s: None
Affects Version/s: 0.15.3
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 3144-4.patch 2008-05-01 09:29 PM Zheng Shao 8 kB
Text File Licensed for inclusion in ASF works 3144-5.patch 2008-05-05 09:26 PM Zheng Shao 7 kB
Text File Licensed for inclusion in ASF works 3144-6.patch 2008-05-06 01:11 AM Zheng Shao 7 kB
Text File Licensed for inclusion in ASF works 3144-7.patch 2008-05-06 06:44 AM Zheng Shao 8 kB
Text File Licensed for inclusion in ASF works 3144-ignore-spaces-2.patch 2008-04-09 09:53 AM Zheng Shao 8 kB
Text File Licensed for inclusion in ASF works 3144-ignore-spaces-3.patch 2008-04-22 06:14 PM Zheng Shao 8 kB

Hadoop Flags: Reviewed
Resolution Date: 06/May/08 08:22 PM


 Description  « Hide
every once in a while - we encounter corrupted text files (corrupted at source prior to copying into hadoop). inevitably - some of the data looks like a really really long line and hadoop trips over trying to stuff it into an in memory object and gets outofmem error. Code looks same way in trunk as well ..

so looking for an option to the textinputformat (and like) to ignore long lines. ideally - we would just skip errant lines above a certain size limit.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Zheng Shao added a comment - 05/Apr/08 12:13 AM
This patch also makes sure that the LineReader does not read too far into the next split when it's not necessary.

Zheng Shao added a comment - 05/Apr/08 04:35 AM
Updated.

Hadoop QA added a comment - 05/Apr/08 07:02 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12379465/3144.patch
against trunk revision 643282.

@author +1. The patch does not contain any @author tags.

tests included -1. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs -1. The patch appears to cause Findbugs to fail.

core tests -1. The patch failed core unit tests.

contrib tests -1. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2173/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2173/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2173/console

This message is automatically generated.


Zheng Shao added a comment - 05/Apr/08 08:19 AM
Added a test.

Zheng Shao added a comment - 05/Apr/08 06:36 PM
Corrected a warning in "ant compile-contrib"

dhruba borthakur added a comment - 07/Apr/08 06:22 AM
Hi Zheng, I see that there are many changes in the patch that are changes only in whitespace characters. This makes the patch longer and is harder to review. Will it be possible for you to re-submit this patch with whitespace-only changes to lines removed? thanks.

Zheng Shao added a comment - 07/Apr/08 07:57 AM
Generated by "svn diff -x -uw"

Zheng Shao added a comment - 09/Apr/08 09:53 AM
Fixed a bug.

Zheng Shao added a comment - 09/Apr/08 09:53 AM
fixed a bug.

Zheng Shao added a comment - 11/Apr/08 09:10 PM
Hi Dhruba,

Did you get a chance to look at the new diff?

Zheng


Zheng Shao added a comment - 22/Apr/08 06:14 PM
changing name to make sure hadoop tester pick it up.

Hadoop QA added a comment - 23/Apr/08 10:25 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12380714/3144-ignore-spaces-3.patch
against trunk revision 645773.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2311/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2311/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2311/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2311/console

This message is automatically generated.


Chris Douglas added a comment - 25/Apr/08 11:00 PM
  • Do these limits really need to be a longs? Changing the public API of readLine seems unnecessary when an int should be- and has been- sufficient.
  • There is some odd spacing around LineRecordReader::157,268 that make it difficult to tell which block the closing brace belongs to
  • I'm not sure I understand the skip logic. For the case where a line is larger than 64k (the buffer size), it looks like this reads up to a threshold, then discards input that exceeds what was requested, then returns the next record as the segment between the point in the threshold and the following newline (i.e. the trailing bytes of the too-long record). Is this accurate? Instead of getting a random segment of a record, wouldn't it be preferred to discard input until the next record boundary is found?

Zheng Shao added a comment - 26/Apr/08 01:43 AM
  • It used to be sufficient does not mean that they will be sufficient in the future - that's why we have open64. The cost of using a long instead of an int is minimal, while we do avoid potential overflow problems. The only interesting usage of this return value is accumulating the number of bytes read, which definitely should be stored in a long. So I don't see a problem here.
  • I will fix the spacing problem when we get a consensus on other problems.
  • The skip logic is to skip the whole long line - not just "maxLineLength" of bytes.

The reason for "maxBytesToConsume" is to tell readLine the end of this block - there is no reason for the readLine to go through tens of gigs of data search for an end of line, while the current block is only 128MB. This is actually what was happening on our cluster - for binary file that a user mistakenly treats as a text file. All map jobs just swamped the cluster. The only use of maxBytesToConsume is to let readLine know when to stop. What would be the best way to fix this?


Chris Douglas added a comment - 26/Apr/08 02:35 AM

It used to be sufficient does not mean that they will be sufficient in the future - that's why we have open64. The cost of using a long instead of an int is minimal, while we do avoid potential overflow problems

True, but it's accumulating bytes read from a text file into memory for a single record. It's not at all obvious to me that this requires a long. Future-proofing a case that will be a total disaster for the rest of the framework seems premature, particularly when the change is to a generic text parser. If someone truly needs to slurp >2GB of text data per record, surely their requirements justify a less general RecordReader. It's not the cost of the int that concerns me, but rather it's the API change to support a case that's not only degenerate, but implausible.

The reason for "maxBytesToConsume" is to tell readLine the end of this block - there is no reason for the readLine to go through tens of gigs of data search for an end of line, while the current block is only 128MB.

A far more portable solution for what this expresses would be an InputFormat generating a subclass of FileSplit annotated with a hard limit enforced by the RecordReader (i.e. returns EOF at some position within the file). Some of this will inevitably be done as part of the Hadoop archive work (HADOOP-3307). As a workaround, don't point text readers at binary data.


Joydeep Sen Sarma added a comment - 26/Apr/08 06:54 AM
we did not point the text reader at a binary file. we had a corrupted text file filled with long section of junk.

given that this is a problem that can happen to anyone (we just happen to be the lucky first) - and everyone uses textinputformat to read text files - why shouldn't the safeguard be built into textinputformat? what's the downside? (does it make sense to buy insurance after an accident? - we wait for people to hit such a problem and then say - oh, but u should have used 'SafeTextInputFormat'?)

portable across what? i looked at 3307 earlier today. i don't know how it's remotely related. enlighten us.

i am sorry - but i am mildly irritated by the comments here. we are aware of the concept of subclassing. and we can write our own inputformat - thank u so much. the whole point of going through this procedure is to contribute back to the community something that is of general benefit. Either the argument is that this is not of general benefit - or that the cost outweighs the benefit. Neither argument has been made.


Chris Douglas added a comment - 26/Apr/08 09:52 AM
I was going by Zheng's last comment, i.e. "This is actually what was happening on our cluster - for binary file that a user mistakenly treats as a text file." I didn't mean anything by it.

Assuming we're discussing the third bullet in the last two comments:

If one is loading corrupted data into HDFS, then I don't think it's fair to assume that the most generic of text readers can do anything with it. I mention the archive format because it seems unavoidable that opening a file in an archive will return bounded stream within a large, composite file, i.e. be agnostic to the particular InputFormat employed, but act a lot like a bounded FileSplit. If that's the sort of thing you could use to deal with sketchy data, then it seemed to be a useful issue to monitor. Alternatively, a new InputFormat that generated bounded splits for Text files to recover from this condition might work for your case, and probably for others' if you felt like contributing it.

The insurance analogy doesn't seem to describe this error. It's not like a car accident; it's like filling one's gas tank with creamed corn. Though he had every reason to believe it was gasoline- and is understandably angry that his engine is full of creamed corn- anger at the car for failing to run on the creamed corn is misspent. Though I like the idea in general- i.e. skipping unexpectedly long lines, or even just truncating records- my original question was trying to determine whether it skipped to the next record, continued reading bytes into the next record from wherever it stopped, or quit for extremely long lines. At a glance, it looked like it continued reading at wherever it left off in the stream, but I haven't looked at it as closely as the contributor and wanted to ask after its behavior. I'm still curious how, exactly, this patch effects its solution.


Joydeep Sen Sarma added a comment - 26/Apr/08 04:16 PM
one of the founding principles of map-reduce as described in the google paper (and perhaps one of the most remarkable differences with general database systems) was the notion of being tolerant of bad data. if u see a few rows of bad data - skip it.

we try to do this in the application land as much as possible. however, it is not possible for us to do anything if Hadoop throws an out of memory error. Hence this fix belongs in hadoop core. Zheng's fix does skip to the next available record (if it falls within the split). Otherwise an EOF is returned.

3307 is way out there man - it's a solution for small files. if the file was small - we wouldn't have a problem to begin with (as u say - the input is bounded). this problem only affects large files. if u read the description of 3307 carefully - u will notice it says that it has no impact on map-reduce. The problem we are trying to solve is a map-reduce problem - it applies whether the file comes from an archive (3307) or from local file system or hdfs (or any file system for that matter).


Chris Douglas added a comment - 28/Apr/08 12:04 AM

one of the founding principles of map-reduce as described in the google paper (and perhaps one of the most remarkable differences with general database systems) was the notion of being tolerant of bad data. if u see a few rows of bad data - skip it.

It's been awhile since I read the paper, but isn't recovery effected by skipping the record that caused a failure on the map (HADOOP-153)? Recovery from corrupted data without re-executing the map sounds like a solution for a less generic format than LineRecordReader; detecting and failing/discarding a map because its output is corrupt is application code, I agree, and this looks like Zheng has a very reasonable, general workaround (more below).

Given the re-execution model, the "correct" and more general fix would be to fail the map- with an OOM exception- and skip the range that had already been read. If it read into the following split, then it need not be rescheduled because we know that another task had already scanned up to the next record boundary (or failed trying). If one wants to fail the task earlier, then specifying a "SafeTextInputFormat" isn't a terrible burden, but you have a point: a property that controls special cases for TextInputFormat is more usable. Without HADOOP-153, the point is moot, and perhaps this fix is more pressing as a consequence.

Zheng's fix does skip to the next available record (if it falls within the split). Otherwise an EOF is returned.

That's not a full description of what it does, though. I took a closer look, and it doesn't do what I had assumed, i.e. define both a max line length and force a hard limit for reading into the following split (which is why the archive format didn't seem like a non sequitur). It defines a single new property that defines the maximum line length, which prevents the situation in this JIRA by terminating the record reader if it's past the end of the split, having consumed the maximum line length. Since it takes the maximum of what remains in the split and the aforementioned length as the limit, the situation I asked after (i.e. returning the trailing part of a record as a single record) doesn't occur. Since it defaults to Long.MAX_VALUE, there's no issue with existing code. That's all I was trying to determine. The API change (changing the return type of readLine from int to long) makes more sense in this context, but it still seems unnecessary.


Joydeep Sen Sarma added a comment - 28/Apr/08 04:00 AM
> isn't recovery effected by skipping the record that caused a failure on the map (HADOOP-153)?

thanks for pointing this out. this jira is not fixed and looks like there's still a debate on what the right approach is .. it seems that even if the jira were fixed - the linerecordreader would have to implement an additional api to skip to the next record boundary (to skip the bad record on map re-try) - so looks like we would need similar code - albeit under a different api.

that said - i am not sure i agree with the design of 153. it's not clear to me why it doesn't suffice to let the recordreaders skip bad records (as they must be able to even with 153's additional apis). but that's a separate discussion ..

what's the status of 153? seems like depending on where it goes - these changes may conflict or overlap ..


Chris Douglas added a comment - 30/Apr/08 06:10 PM
I don't think HADOOP-153 can completely replace this fix, particularly in the near term. The *-3 patch looks good, but the API change should be reverted and the indentation fixed. It might also make sense to upgrade the warning from INFO to WARN.

Zheng Shao added a comment - 01/May/08 09:29 PM
Fixed the problems mentioned in Chris's comment.

Hadoop QA added a comment - 02/May/08 01:30 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381276/3144-4.patch
against trunk revision 645773.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2362/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2362/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2362/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2362/console

This message is automatically generated.


Zheng Shao added a comment - 02/May/08 07:12 PM
Chris, will you have time to take a look at this today?

Chris Douglas added a comment - 03/May/08 12:25 AM
This looks good, but I don't understand this change:
@@ -166,7 +207,7 @@
     boolean skipFirstLine = false;
     if (codec != null) {
       in = new LineReader(codec.createInputStream(fileIn), job);
-      end = Long.MAX_VALUE;
+      end = Integer.MAX_VALUE;
     } else {
       if (start != 0) {
         skipFirstLine = true;

Is this to avoid the overflow for the cast to int in next()?

Instead, in the call to readLine in next:

+      int newSize = in.readLine(value, maxLineLength,
+                                (int)Math.max(end-pos, (long)maxLineLength));

It might be better to use (with appropriate casts):

+      int newSize = in.readLine(value, maxLineLength,
+                                Math.max(Math.min(end - pos, Integer.MAX_VALUE), maxLineLength));

Which makes Long.MAX_VALUE correct while avoiding the overflow, right?


Zheng Shao added a comment - 05/May/08 09:26 PM
Fixed the 2 problems in Chris's comments:

1. Restored Long.MAX_VALUE;
2. That line contains a bug: I should have used min instead of max.


Hadoop QA added a comment - 05/May/08 10:52 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381447/3144-5.patch
against trunk revision 653607.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2399/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2399/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2399/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2399/console

This message is automatically generated.


Zheng Shao added a comment - 05/May/08 10:56 PM
Chris, got time to take a look again?

Chris Douglas added a comment - 06/May/08 12:34 AM
The actual calculated for maxBytesToConsume in next() can still cause the overflow error you mentioned above; it's not sufficient to cast it to int, unfortunately. The purpose of this patch is to prevent LineRecordReader from reading absurdly far into the next split, right? Is there something wrong with max(min(end-pos, MAX_INT), maxLineLen) ?

Zheng Shao added a comment - 06/May/08 01:06 AM
You are right. I was looking at line 221 and fixed a bug there in the last patch.

The new patch is coming.


Chris Douglas added a comment - 06/May/08 02:30 AM
I didn't see that bytesConsumed in readLine had changed from long to int in the last patch. If maxBytesToConsume is set to Integer.MAX_VALUE, as many cases would have it, then the overflow might still be missed (in the unlikely event this is hit before an OOM exception). Either bytesConsumed should be a long and min(bytesConsumed, Integer.MAX_VALUE) returned, or the overflow should be detected.

The indentation at LineRecordReader:157 is still funky.

Has this been tested against the corrupted data?


Hadoop QA added a comment - 06/May/08 02:31 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381468/3144-6.patch
against trunk revision 653638.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2402/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2402/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2402/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2402/console

This message is automatically generated.


Zheng Shao added a comment - 06/May/08 06:44 AM
changed bytesConsumed back to long.

Zheng Shao added a comment - 06/May/08 06:45 AM
The code is tested with a corrupted file of 1GB (contains all ^A), with a streaming job of -mapper cat.
It correctly ignored all the data.

Chris Douglas added a comment - 06/May/08 07:24 AM
+1 Looks good

Hadoop QA added a comment - 06/May/08 02:14 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381485/3144-7.patch
against trunk revision 653749.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2410/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2410/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2410/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2410/console

This message is automatically generated.


Chris Douglas added a comment - 06/May/08 08:22 PM
I just committed this. Thanks, Zheng

Hudson added a comment - 07/May/08 12:22 PM