Hadoop Common
  1. Hadoop Common
  2. HADOOP-3914

checksumOk implementation in DFSClient can break applications

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.17.1
    • Fix Version/s: 0.18.2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      One of our non-map-reduce applications (written in C and using libhdfs to access dfs) stopped working after switch from 0.16 to 0.17.
      The problem was finally traced down to failures in checksumOk.

      I would assume, the purpose of checksumOk is for a DfsClient to indicate to a sending Datanode that the checksum of the received block is okay. This must be useful in the replication pipeline.
      How checksumOk is implemented is that any IOException is caught and ignored, probably because it is not essential for the client that the message is successful.

      But it proved fatal for our application because this application links in a 3rd-party library which seems to catch socket exceptions before libhdfs.

      Why was there an Exception? In our case the application reads a file into the local buffer of the DFSInputStream large enough to hold all data, the application reads to the end and the checksumOK is sent successfully at that time. But then the application does some other stuff and comes back to re-read the file (still open). It is then when it calls checksumOk again and crashes.

      It can easily be avoided by adding a Boolean making sure that checksumOk is called exactly once when EOS is encountered. Redundant calls to checksumOk do not seem to make sense anyhow.

      1. patch.HADOOP-3914
        1 kB
        Christian Kunz
      2. checksumOk.patch
        1 kB
        Hairong Kuang
      3. checksumOk1.patch
        1 kB
        Hairong Kuang
      4. checksumOk1-br18.patch
        1 kB
        Hairong Kuang

        Issue Links

          Activity

          Christian Kunz created issue -
          Hide
          Raghu Angadi added a comment -

          Christian,
          I didn't quite follow whats happening in your case. Can you give more detail? This was invoked in 0.16 as well. I don't think it is expected to be invoked more than once.

          Show
          Raghu Angadi added a comment - Christian, I didn't quite follow whats happening in your case. Can you give more detail? This was invoked in 0.16 as well. I don't think it is expected to be invoked more than once.
          Hide
          Christian Kunz added a comment -

          The location where checksumOk is called moved to a different place in 0.17 compared to 0.16.

          In 0.16 checksumOk is called in
          DFSInputStream.read
          directly:
          if ( pos > blockEnd )

          { blockReader.checksumOk(s); }

          In 0.17 it is called in
          BlockReader.read:
          if (nRead >= 0 && gotEOS && needChecksum())

          { //checksum is verified and there are no errors. checksumOk(dnSock); }

          That condition, once true, stays true even when you seek backwards in the local buffer and read some data again.

          Show
          Christian Kunz added a comment - The location where checksumOk is called moved to a different place in 0.17 compared to 0.16. In 0.16 checksumOk is called in DFSInputStream.read directly: if ( pos > blockEnd ) { blockReader.checksumOk(s); } In 0.17 it is called in BlockReader.read: if (nRead >= 0 && gotEOS && needChecksum()) { //checksum is verified and there are no errors. checksumOk(dnSock); } That condition, once true, stays true even when you seek backwards in the local buffer and read some data again.
          Hide
          Raghu Angadi added a comment -

          Yes, location has changed, but I am still trying to see how that gets triggered. I see two related but different issues :

          1. DFSClient is triggering an IOException that it could ignore. It should not trigger these exceptions unnecessarily, of course.
            • we could fix this as part of this jira.
          2. An exception that it expected to be completely internal to DFSClient is breaking some other application.
            • I am surprised that this happens, would surely like to more. we may not able to fix this.
            • More importantly legitimate exceptions could break these applications.
          Show
          Raghu Angadi added a comment - Yes, location has changed, but I am still trying to see how that gets triggered. I see two related but different issues : DFSClient is triggering an IOException that it could ignore. It should not trigger these exceptions unnecessarily, of course. we could fix this as part of this jira. An exception that it expected to be completely internal to DFSClient is breaking some other application. I am surprised that this happens, would surely like to more. we may not able to fix this. More importantly legitimate exceptions could break these applications.
          Hide
          Christian Kunz added a comment -

          As a work around of this problem we apply a patch to every hadoop release, but we would prefer to use hadoop releases as-is.

          The patch should not affect the pipelining of blocks during writing, as it only makes sure that the checksumOk is sent only once.

          I am attaching the patch now for review.

          Show
          Christian Kunz added a comment - As a work around of this problem we apply a patch to every hadoop release, but we would prefer to use hadoop releases as-is. The patch should not affect the pipelining of blocks during writing, as it only makes sure that the checksumOk is sent only once. I am attaching the patch now for review.
          Christian Kunz made changes -
          Field Original Value New Value
          Attachment patch.HADOOP-3914 [ 12390147 ]
          Christian Kunz made changes -
          Fix Version/s 0.19.0 [ 12313211 ]
          Christian Kunz made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Raghu Angadi added a comment -

          Though the patch seems harmless and useful, we neither know why or how the bug gets triggered inside DFSClient nor how it causes eventual application failure. I am a bit uneasy about that, but would not '-1' the patch.

          Show
          Raghu Angadi added a comment - Though the patch seems harmless and useful, we neither know why or how the bug gets triggered inside DFSClient nor how it causes eventual application failure. I am a bit uneasy about that, but would not '-1' the patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12390147/patch.HADOOP-3914
          against trunk revision 696040.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12390147/patch.HADOOP-3914 against trunk revision 696040. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3278/console This message is automatically generated.
          Chris Douglas made changes -
          Assignee Christian Kunz [ ckunz ]
          Christian Kunz made changes -
          Fix Version/s 0.18.2 [ 12313424 ]
          Devaraj Das made changes -
          Fix Version/s 0.19.0 [ 12313211 ]
          Fix Version/s 0.18.2 [ 12313424 ]
          Robert Chansler made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Robert Chansler added a comment -

          Run the tests again.

          Show
          Robert Chansler added a comment - Run the tests again.
          Robert Chansler made changes -
          Fix Version/s 0.19.0 [ 12313211 ]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12390147/patch.HADOOP-3914
          against trunk revision 705691.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12390147/patch.HADOOP-3914 against trunk revision 705691. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3487/console This message is automatically generated.
          Hide
          Robert Chansler added a comment -

          Given that Christian has been running this in production for several weeks, I propose we commit this to 0.19.

          Show
          Robert Chansler added a comment - Given that Christian has been running this in production for several weeks, I propose we commit this to 0.19.
          Robert Chansler made changes -
          Priority Major [ 3 ] Blocker [ 1 ]
          Hide
          Hairong Kuang added a comment -

          Hi Christian, when checksumOk is called a second time, could we log it and the stack trace? As a result, we can investigate the cause of the problem if it happens again.

          Show
          Hairong Kuang added a comment - Hi Christian, when checksumOk is called a second time, could we log it and the stack trace? As a result, we can investigate the cause of the problem if it happens again.
          Hide
          Hairong Kuang added a comment -

          This patch logs the error when checksumOk is sent more than once.

          Show
          Hairong Kuang added a comment - This patch logs the error when checksumOk is sent more than once.
          Hairong Kuang made changes -
          Attachment checksumOk.patch [ 12392597 ]
          Hide
          Raghu Angadi added a comment -

          Looks good. Could you mention it is a debug log either in the comment or in the message for exception?

          Show
          Raghu Angadi added a comment - Looks good. Could you mention it is a debug log either in the comment or in the message for exception?
          Hide
          Christian Kunz added a comment -

          Hairong, thank you for doing this – I did not get around to it so quickly.

          Show
          Christian Kunz added a comment - Hairong, thank you for doing this – I did not get around to it so quickly.
          Hide
          Hairong Kuang added a comment -

          Thanks Raghu for the comment. This patch adds "for the debugging purpose" in the logging comment.

          Christian, you are welcome! We want your patch to be committed so you have less pain running HADOOP!

          Show
          Hairong Kuang added a comment - Thanks Raghu for the comment. This patch adds "for the debugging purpose" in the logging comment. Christian, you are welcome! We want your patch to be committed so you have less pain running HADOOP!
          Hairong Kuang made changes -
          Attachment checksumOk1.patch [ 12392600 ]
          Hairong Kuang made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hairong Kuang made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hairong Kuang added a comment -

          It looks that Hudson does not pick up available patches again. I've run the test on my local machine.
          $ ant test-core
          ..
          BUILD SUCCESSFUL
          Total time: 106 minutes 33 seconds

          $ ant test-patch
          [exec] -1 overall.

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec] -1 tests included. The patch doesn't appear to include anynew or modified tests.
          [exec] Please justify why no tests are needed for this patch.

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec] +1 findbugs. The patch does not introduce any new Findbugswarnings.

          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          Since the major fix of this patch has already been running on production clusters, a unit test is intentionally exempted.

          Show
          Hairong Kuang added a comment - It looks that Hudson does not pick up available patches again. I've run the test on my local machine. $ ant test-core .. BUILD SUCCESSFUL Total time: 106 minutes 33 seconds $ ant test-patch [exec] -1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] -1 tests included. The patch doesn't appear to include anynew or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugswarnings. [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. Since the major fix of this patch has already been running on production clusters, a unit test is intentionally exempted.
          Hide
          Hairong Kuang added a comment -

          Patch for the 0.18 branch.

          Show
          Hairong Kuang added a comment - Patch for the 0.18 branch.
          Hairong Kuang made changes -
          Attachment checksumOk1-br18.patch [ 12392609 ]
          Hide
          Hairong Kuang added a comment -

          I just committed this. Thanks Christian!

          Show
          Hairong Kuang added a comment - I just committed this. Thanks Christian!
          Hairong Kuang made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.19.0 [ 12313211 ]
          Resolution Fixed [ 1 ]
          Hadoop Flags [Reviewed]
          Fix Version/s 0.18.2 [ 12313424 ]
          Hide
          Raghu Angadi added a comment -

          hmm.. I can see why this happens? Should we remove the debug print out?

          Show
          Raghu Angadi added a comment - hmm.. I can see why this happens? Should we remove the debug print out?
          Hide
          Raghu Angadi added a comment -

          I should add that because of the debug log, I noticed this in some test failures and it is quite easy to reproduce.

          Show
          Raghu Angadi added a comment - I should add that because of the debug log, I noticed this in some test failures and it is quite easy to reproduce.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #640 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/640/ )
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Jason added a comment -

          We applied this patch, on a machine that was failing to read a directory of files (reads of the individual files were fine)
          hadoop dfs -text path_to_directory/'*'

          08/11/04 14:08:32 [main] INFO fs.FSInputChecker: java.io.IOException: Checksum ok was sent and should not be sent again
          at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:863)
          at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1392)
          at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1428)
          at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1377)
          at java.io.DataInputStream.readInt(DataInputStream.java:370)
          at org.apache.hadoop.io.SequenceFile$Metadata.readFields(SequenceFile.java:725)
          at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1511)
          at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
          at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420)
          at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415)
          at org.apache.hadoop.fs.FsShell$TextRecordInputStream.<init>(FsShell.java:365)
          at org.apache.hadoop.fs.FsShell.forMagic(FsShell.java:403)
          at org.apache.hadoop.fs.FsShell.access$200(FsShell.java:49)
          at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:419)
          at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1865)
          at org.apache.hadoop.fs.FsShell.text(FsShell.java:413)
          at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1532)
          at org.apache.hadoop.fs.FsShell.run(FsShell.java:1730)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
          at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)

          Show
          Jason added a comment - We applied this patch, on a machine that was failing to read a directory of files (reads of the individual files were fine) hadoop dfs -text path_to_directory/'*' 08/11/04 14:08:32 [main] INFO fs.FSInputChecker: java.io.IOException: Checksum ok was sent and should not be sent again at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:863) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1392) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1428) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1377) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.io.SequenceFile$Metadata.readFields(SequenceFile.java:725) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1511) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1431) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1420) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1415) at org.apache.hadoop.fs.FsShell$TextRecordInputStream.<init>(FsShell.java:365) at org.apache.hadoop.fs.FsShell.forMagic(FsShell.java:403) at org.apache.hadoop.fs.FsShell.access$200(FsShell.java:49) at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:419) at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1865) at org.apache.hadoop.fs.FsShell.text(FsShell.java:413) at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1532) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1730) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
          Hide
          Raghu Angadi added a comment -

          We applied this patch, on a machine that was failing to read a directory of files (reads of the individual files were fine) hadoop dfs -text path_to_directory/'*'

          You need HADOOP-4499.

          Show
          Raghu Angadi added a comment - We applied this patch, on a machine that was failing to read a directory of files (reads of the individual files were fine) hadoop dfs -text path_to_directory/'*' You need HADOOP-4499 .
          Raghu Angadi made changes -
          Link This issue relates to HADOOP-4499 [ HADOOP-4499 ]
          Hide
          Raghu Angadi added a comment -

          Anyone who has this patch also need HADOOP-4499.

          Show
          Raghu Angadi added a comment - Anyone who has this patch also need HADOOP-4499 .
          Hide
          Raghu Angadi added a comment -

          Committed HADOOP-4499 to 0.18.3.

          Show
          Raghu Angadi added a comment - Committed HADOOP-4499 to 0.18.3.
          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]

            People

            • Assignee:
              Christian Kunz
              Reporter:
              Christian Kunz
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development