|
The location where checksumOk is called moved to a different place in 0.17 compared to 0.16.
In 0.16 checksumOk is called in In 0.17 it is called in Yes, location has changed, but I am still trying to see how that gets triggered. I see two related but different issues :
As a work around of this problem we apply a patch to every hadoop release, but we would prefer to use hadoop releases as-is.
The patch should not affect the pipelining of blocks during writing, as it only makes sure that the checksumOk is sent only once. I am attaching the patch now for review. Though the patch seems harmless and useful, we neither know why or how the bug gets triggered inside DFSClient nor how it causes eventual application failure. I am a bit uneasy about that, but would not '-1' the patch.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390147/patch.HADOOP-3914 against trunk revision 696040. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/testReport/ This message is automatically generated. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390147/patch.HADOOP-3914 against trunk revision 705691. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/testReport/ This message is automatically generated. Given that Christian has been running this in production for several weeks, I propose we commit this to 0.19.
Hi Christian, when checksumOk is called a second time, could we log it and the stack trace? As a result, we can investigate the cause of the problem if it happens again.
This patch logs the error when checksumOk is sent more than once.
Looks good. Could you mention it is a debug log either in the comment or in the message for exception?
Hairong, thank you for doing this – I did not get around to it so quickly.
Thanks Raghu for the comment. This patch adds "for the debugging purpose" in the logging comment.
Christian, you are welcome! We want your patch to be committed so you have less pain running HADOOP! It looks that Hudson does not pick up available patches again. I've run the test on my local machine.
$ ant test-core .. BUILD SUCCESSFUL Total time: 106 minutes 33 seconds $ ant test-patch [exec] +1 @author. The patch does not contain any @author tags. [exec] -1 tests included. The patch doesn't appear to include anynew or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugswarnings. [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. Since the major fix of this patch has already been running on production clusters, a unit test is intentionally exempted. I just committed this. Thanks Christian!
hmm.. I can see why this happens? Should we remove the debug print out?
I should add that because of the debug log, I noticed this in some test failures and it is quite easy to reproduce.
Integrated in Hadoop-trunk #640 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/640/
We applied this patch, on a machine that was failing to read a directory of files (reads of the individual files were fine)
hadoop dfs -text path_to_directory/'*' 08/11/04 14:08:32 [main] INFO fs.FSInputChecker: java.io.IOException: Checksum ok was sent and should not be sent again
You need Anyone who has this patch also need
Committed
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I didn't quite follow whats happening in your case. Can you give more detail? This was invoked in 0.16 as well. I don't think it is expected to be invoked more than once.