|
[
Permlink
| « Hide
]
Zheng Shao added a comment - 05/Apr/08 12:13 AM
This patch also makes sure that the LineReader does not read too far into the next split when it's not necessary.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12379465/3144.patch against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to cause Findbugs to fail. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2173/testReport/ This message is automatically generated. Corrected a warning in "ant compile-contrib"
Hi Zheng, I see that there are many changes in the patch that are changes only in whitespace characters. This makes the patch longer and is harder to review. Will it be possible for you to re-submit this patch with whitespace-only changes to lines removed? thanks.
Generated by "svn diff -x -uw"
Hi Dhruba,
Did you get a chance to look at the new diff? Zheng changing name to make sure hadoop tester pick it up.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12380714/3144-ignore-spaces-3.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2311/testReport/ This message is automatically generated.
The reason for "maxBytesToConsume" is to tell readLine the end of this block - there is no reason for the readLine to go through tens of gigs of data search for an end of line, while the current block is only 128MB. This is actually what was happening on our cluster - for binary file that a user mistakenly treats as a text file. All map jobs just swamped the cluster. The only use of maxBytesToConsume is to let readLine know when to stop. What would be the best way to fix this?
True, but it's accumulating bytes read from a text file into memory for a single record. It's not at all obvious to me that this requires a long. Future-proofing a case that will be a total disaster for the rest of the framework seems premature, particularly when the change is to a generic text parser. If someone truly needs to slurp >2GB of text data per record, surely their requirements justify a less general RecordReader. It's not the cost of the int that concerns me, but rather it's the API change to support a case that's not only degenerate, but implausible.
A far more portable solution for what this expresses would be an InputFormat generating a subclass of FileSplit annotated with a hard limit enforced by the RecordReader (i.e. returns EOF at some position within the file). Some of this will inevitably be done as part of the Hadoop archive work ( we did not point the text reader at a binary file. we had a corrupted text file filled with long section of junk.
given that this is a problem that can happen to anyone (we just happen to be the lucky first) - and everyone uses textinputformat to read text files - why shouldn't the safeguard be built into textinputformat? what's the downside? (does it make sense to buy insurance after an accident? - we wait for people to hit such a problem and then say - oh, but u should have used 'SafeTextInputFormat'?) portable across what? i looked at 3307 earlier today. i don't know how it's remotely related. enlighten us. i am sorry - but i am mildly irritated by the comments here. we are aware of the concept of subclassing. and we can write our own inputformat - thank u so much. the whole point of going through this procedure is to contribute back to the community something that is of general benefit. Either the argument is that this is not of general benefit - or that the cost outweighs the benefit. Neither argument has been made. I was going by Zheng's last comment, i.e. "This is actually what was happening on our cluster - for binary file that a user mistakenly treats as a text file." I didn't mean anything by it.
Assuming we're discussing the third bullet in the last two comments: If one is loading corrupted data into HDFS, then I don't think it's fair to assume that the most generic of text readers can do anything with it. I mention the archive format because it seems unavoidable that opening a file in an archive will return bounded stream within a large, composite file, i.e. be agnostic to the particular InputFormat employed, but act a lot like a bounded FileSplit. If that's the sort of thing you could use to deal with sketchy data, then it seemed to be a useful issue to monitor. Alternatively, a new InputFormat that generated bounded splits for Text files to recover from this condition might work for your case, and probably for others' if you felt like contributing it. The insurance analogy doesn't seem to describe this error. It's not like a car accident; it's like filling one's gas tank with creamed corn. Though he had every reason to believe it was gasoline- and is understandably angry that his engine is full of creamed corn- anger at the car for failing to run on the creamed corn is misspent. Though I like the idea in general- i.e. skipping unexpectedly long lines, or even just truncating records- my original question was trying to determine whether it skipped to the next record, continued reading bytes into the next record from wherever it stopped, or quit for extremely long lines. At a glance, it looked like it continued reading at wherever it left off in the stream, but I haven't looked at it as closely as the contributor and wanted to ask after its behavior. I'm still curious how, exactly, this patch effects its solution. one of the founding principles of map-reduce as described in the google paper (and perhaps one of the most remarkable differences with general database systems) was the notion of being tolerant of bad data. if u see a few rows of bad data - skip it.
we try to do this in the application land as much as possible. however, it is not possible for us to do anything if Hadoop throws an out of memory error. Hence this fix belongs in hadoop core. Zheng's fix does skip to the next available record (if it falls within the split). Otherwise an EOF is returned. 3307 is way out there man - it's a solution for small files. if the file was small - we wouldn't have a problem to begin with (as u say - the input is bounded). this problem only affects large files. if u read the description of 3307 carefully - u will notice it says that it has no impact on map-reduce. The problem we are trying to solve is a map-reduce problem - it applies whether the file comes from an archive (3307) or from local file system or hdfs (or any file system for that matter).
It's been awhile since I read the paper, but isn't recovery effected by skipping the record that caused a failure on the map ( Given the re-execution model, the "correct" and more general fix would be to fail the map- with an OOM exception- and skip the range that had already been read. If it read into the following split, then it need not be rescheduled because we know that another task had already scanned up to the next record boundary (or failed trying). If one wants to fail the task earlier, then specifying a "SafeTextInputFormat" isn't a terrible burden, but you have a point: a property that controls special cases for TextInputFormat is more usable. Without
That's not a full description of what it does, though. I took a closer look, and it doesn't do what I had assumed, i.e. define both a max line length and force a hard limit for reading into the following split (which is why the archive format didn't seem like a non sequitur). It defines a single new property that defines the maximum line length, which prevents the situation in this JIRA by terminating the record reader if it's past the end of the split, having consumed the maximum line length. Since it takes the maximum of what remains in the split and the aforementioned length as the limit, the situation I asked after (i.e. returning the trailing part of a record as a single record) doesn't occur. Since it defaults to Long.MAX_VALUE, there's no issue with existing code. That's all I was trying to determine. The API change (changing the return type of readLine from int to long) makes more sense in this context, but it still seems unnecessary. > isn't recovery effected by skipping the record that caused a failure on the map (
thanks for pointing this out. this jira is not fixed and looks like there's still a debate on what the right approach is .. it seems that even if the jira were fixed - the linerecordreader would have to implement an additional api to skip to the next record boundary (to skip the bad record on map re-try) - so looks like we would need similar code - albeit under a different api. that said - i am not sure i agree with the design of 153. it's not clear to me why it doesn't suffice to let the recordreaders skip bad records (as they must be able to even with 153's additional apis). but that's a separate discussion .. what's the status of 153? seems like depending on where it goes - these changes may conflict or overlap .. I don't think
Fixed the problems mentioned in Chris's comment.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381276/3144-4.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2362/testReport/ This message is automatically generated. Chris, will you have time to take a look at this today?
This looks good, but I don't understand this change:
@@ -166,7 +207,7 @@
boolean skipFirstLine = false;
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
- end = Long.MAX_VALUE;
+ end = Integer.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
Is this to avoid the overflow for the cast to int in next()? Instead, in the call to readLine in next: + int newSize = in.readLine(value, maxLineLength, + (int)Math.max(end-pos, (long)maxLineLength)); It might be better to use (with appropriate casts): + int newSize = in.readLine(value, maxLineLength, + Math.max(Math.min(end - pos, Integer.MAX_VALUE), maxLineLength)); Which makes Long.MAX_VALUE correct while avoiding the overflow, right? Fixed the 2 problems in Chris's comments:
1. Restored Long.MAX_VALUE; +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381447/3144-5.patch against trunk revision 653607. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2399/testReport/ This message is automatically generated. Chris, got time to take a look again?
The actual calculated for maxBytesToConsume in next() can still cause the overflow error you mentioned above; it's not sufficient to cast it to int, unfortunately. The purpose of this patch is to prevent LineRecordReader from reading absurdly far into the next split, right? Is there something wrong with max(min(end-pos, MAX_INT), maxLineLen) ?
You are right. I was looking at line 221 and fixed a bug there in the last patch.
The new patch is coming. I didn't see that bytesConsumed in readLine had changed from long to int in the last patch. If maxBytesToConsume is set to Integer.MAX_VALUE, as many cases would have it, then the overflow might still be missed (in the unlikely event this is hit before an OOM exception). Either bytesConsumed should be a long and min(bytesConsumed, Integer.MAX_VALUE) returned, or the overflow should be detected.
The indentation at LineRecordReader:157 is still funky. Has this been tested against the corrupted data? +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381468/3144-6.patch against trunk revision 653638. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2402/testReport/ This message is automatically generated. changed bytesConsumed back to long.
The code is tested with a corrupted file of 1GB (contains all ^A), with a streaming job of -mapper cat.
It correctly ignored all the data. +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381485/3144-7.patch against trunk revision 653749. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2410/testReport/ This message is automatically generated. I just committed this. Thanks, Zheng
Integrated in Hadoop-trunk #483 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/483/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||