Issue Details (XML | Word | Printable)

Key: HADOOP-3914
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Christian Kunz
Reporter: Christian Kunz
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

checksumOk implementation in DFSClient can break applications

Created: 07/Aug/08 05:44 AM   Updated: 08/Jul/09 04:43 PM
Return to search
Component/s: None
Affects Version/s: 0.17.1
Fix Version/s: 0.18.2

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works checksumOk.patch 2008-10-21 06:50 PM Hairong Kuang 1 kB
Text File Licensed for inclusion in ASF works checksumOk1-br18.patch 2008-10-21 10:26 PM Hairong Kuang 1 kB
Text File Licensed for inclusion in ASF works checksumOk1.patch 2008-10-21 07:14 PM Hairong Kuang 1 kB
File Licensed for inclusion in ASF works patch.HADOOP-3914 2008-09-15 11:29 PM Christian Kunz 1 kB
Issue Links:
Reference
 

Hadoop Flags: Reviewed
Resolution Date: 21/Oct/08 10:27 PM


 Description  « Hide
One of our non-map-reduce applications (written in C and using libhdfs to access dfs) stopped working after switch from 0.16 to 0.17.
The problem was finally traced down to failures in checksumOk.

I would assume, the purpose of checksumOk is for a DfsClient to indicate to a sending Datanode that the checksum of the received block is okay. This must be useful in the replication pipeline.
How checksumOk is implemented is that any IOException is caught and ignored, probably because it is not essential for the client that the message is successful.

But it proved fatal for our application because this application links in a 3rd-party library which seems to catch socket exceptions before libhdfs.

Why was there an Exception? In our case the application reads a file into the local buffer of the DFSInputStream large enough to hold all data, the application reads to the end and the checksumOK is sent successfully at that time. But then the application does some other stuff and comes back to re-read the file (still open). It is then when it calls checksumOk again and crashes.

It can easily be avoided by adding a Boolean making sure that checksumOk is called exactly once when EOS is encountered. Redundant calls to checksumOk do not seem to make sense anyhow.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Raghu Angadi added a comment - 07/Aug/08 03:11 PM
Christian,
I didn't quite follow whats happening in your case. Can you give more detail? This was invoked in 0.16 as well. I don't think it is expected to be invoked more than once.

Christian Kunz added a comment - 07/Aug/08 05:18 PM
The location where checksumOk is called moved to a different place in 0.17 compared to 0.16.

In 0.16 checksumOk is called in
DFSInputStream.read
directly:
if ( pos > blockEnd ) { blockReader.checksumOk(s); }

In 0.17 it is called in
BlockReader.read:
if (nRead >= 0 && gotEOS && needChecksum()) { //checksum is verified and there are no errors. checksumOk(dnSock); }
That condition, once true, stays true even when you seek backwards in the local buffer and read some data again.


Raghu Angadi added a comment - 07/Aug/08 05:32 PM
Yes, location has changed, but I am still trying to see how that gets triggered. I see two related but different issues :
  1. DFSClient is triggering an IOException that it could ignore. It should not trigger these exceptions unnecessarily, of course.
    • we could fix this as part of this jira.
  2. An exception that it expected to be completely internal to DFSClient is breaking some other application.
    • I am surprised that this happens, would surely like to more. we may not able to fix this.
    • More importantly legitimate exceptions could break these applications.

Christian Kunz added a comment - 15/Sep/08 11:28 PM
As a work around of this problem we apply a patch to every hadoop release, but we would prefer to use hadoop releases as-is.

The patch should not affect the pipelining of blocks during writing, as it only makes sure that the checksumOk is sent only once.

I am attaching the patch now for review.


Raghu Angadi added a comment - 15/Sep/08 11:49 PM
Though the patch seems harmless and useful, we neither know why or how the bug gets triggered inside DFSClient nor how it causes eventual application failure. I am a bit uneasy about that, but would not '-1' the patch.

Hadoop QA added a comment - 17/Sep/08 12:55 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390147/patch.HADOOP-3914
against trunk revision 696040.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/console

This message is automatically generated.


Robert Chansler added a comment - 16/Oct/08 11:44 PM
Run the tests again.

Hadoop QA added a comment - 17/Oct/08 09:50 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390147/patch.HADOOP-3914
against trunk revision 705691.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/console

This message is automatically generated.


Robert Chansler added a comment - 20/Oct/08 06:00 PM
Given that Christian has been running this in production for several weeks, I propose we commit this to 0.19.

Hairong Kuang added a comment - 20/Oct/08 10:15 PM
Hi Christian, when checksumOk is called a second time, could we log it and the stack trace? As a result, we can investigate the cause of the problem if it happens again.

Hairong Kuang added a comment - 21/Oct/08 06:50 PM
This patch logs the error when checksumOk is sent more than once.

Raghu Angadi added a comment - 21/Oct/08 06:58 PM
Looks good. Could you mention it is a debug log either in the comment or in the message for exception?

Christian Kunz added a comment - 21/Oct/08 07:06 PM
Hairong, thank you for doing this – I did not get around to it so quickly.

Hairong Kuang added a comment - 21/Oct/08 07:14 PM
Thanks Raghu for the comment. This patch adds "for the debugging purpose" in the logging comment.

Christian, you are welcome! We want your patch to be committed so you have less pain running HADOOP!


Hairong Kuang added a comment - 21/Oct/08 09:58 PM
It looks that Hudson does not pick up available patches again. I've run the test on my local machine.
$ ant test-core
..
BUILD SUCCESSFUL
Total time: 106 minutes 33 seconds

$ ant test-patch
[exec] -1 overall.

[exec] +1 @author. The patch does not contain any @author tags.

[exec] -1 tests included. The patch doesn't appear to include anynew or modified tests.
[exec] Please justify why no tests are needed for this patch.

[exec] +1 javadoc. The javadoc tool did not generate any warning messages.

[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

[exec] +1 findbugs. The patch does not introduce any new Findbugswarnings.

[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

Since the major fix of this patch has already been running on production clusters, a unit test is intentionally exempted.


Hairong Kuang added a comment - 21/Oct/08 10:26 PM
Patch for the 0.18 branch.

Hairong Kuang added a comment - 21/Oct/08 10:27 PM
I just committed this. Thanks Christian!

Raghu Angadi added a comment - 22/Oct/08 11:20 PM
hmm.. I can see why this happens? Should we remove the debug print out?

Raghu Angadi added a comment - 22/Oct/08 11:27 PM
I should add that because of the debug log, I noticed this in some test failures and it is quite easy to reproduce.

Hudson added a comment - 23/Oct/08 09:56 PM

Jason added a comment - 04/Nov/08 10:16 PM
We applied this patch, on a machine that was failing to read a directory of files (reads of the individual files were fine)
hadoop dfs -text path_to_directory/'*'

08/11/04 14:08:32 [main] INFO fs.FSInputChecker: java.io.IOException: Checksum ok was sent and should not be sent again
at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:863)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1392)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1428)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1377)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.hadoop.io.SequenceFile$Metadata.readFields(SequenceFile.java:725)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1511)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415)
at org.apache.hadoop.fs.FsShell$TextRecordInputStream.<init>(FsShell.java:365)
at org.apache.hadoop.fs.FsShell.forMagic(FsShell.java:403)
at org.apache.hadoop.fs.FsShell.access$200(FsShell.java:49)
at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:419)
at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1865)
at org.apache.hadoop.fs.FsShell.text(FsShell.java:413)
at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1532)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1730)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)


Raghu Angadi added a comment - 05/Nov/08 06:45 PM

We applied this patch, on a machine that was failing to read a directory of files (reads of the individual files were fine) hadoop dfs -text path_to_directory/'*'

You need HADOOP-4499.


Raghu Angadi added a comment - 05/Nov/08 06:47 PM
Anyone who has this patch also need HADOOP-4499.

Raghu Angadi added a comment - 05/Nov/08 10:13 PM
Committed HADOOP-4499 to 0.18.3.