Hadoop Common
  1. Hadoop Common
  2. HADOOP-731

Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.2
    • Fix Version/s: 0.11.0
    • Component/s: None
    • Labels:
      None

      Description

      for a particular file [alas, the file no longer exists -- I had to progress]

      $dfs -cp foo bar

      and

      $dfs -get foo local

      failed on a checksum error. The dfs browser's download function retrieved the file, so either that function doesn't check, or more likely the download function got a different copy.

      When a checksum fails on one copy of a file that is redundantly stored, I would prefer that dfs try a different copy, mark the bad one as not existing [which should induce a fresh copy being made from one of the good copies eventually], and make the call continue to work and deliver bytes.

      Ideally, if all copies have checksum errors but it's possible to piece together a good copy I would like that to be done.

      -dk

        Issue Links

          Activity

          Hide
          Hairong Kuang added a comment -

          I feel that a patch to http://issues.apache.org/jira/browse/HADOOP-698 should also fix this problem.

          Show
          Hairong Kuang added a comment - I feel that a patch to http://issues.apache.org/jira/browse/HADOOP-698 should also fix this problem.
          Hide
          Sameer Paranjpye added a comment -

          Duplicated in HADOOP-855

          Show
          Sameer Paranjpye added a comment - Duplicated in HADOOP-855
          Hide
          Wendy Chien added a comment -

          Attached a patch which allows us to continue reading after getting a checksum error by modifying Checker.read to catch ChecksumExceptions thrown by verifySum.

          In Checker.read, if we get a ChecksumException, we seek to a new datanode for both the data stream and the checksum stream (when using dfs, this is a no op for other fs). If at least one of the datanodes is different from before, we'll retry the read.

          In DFSInputStream, added a new seek method which also requests a datanode other than the current node.

          Show
          Wendy Chien added a comment - Attached a patch which allows us to continue reading after getting a checksum error by modifying Checker.read to catch ChecksumExceptions thrown by verifySum. In Checker.read, if we get a ChecksumException, we seek to a new datanode for both the data stream and the checksum stream (when using dfs, this is a no op for other fs). If at least one of the datanodes is different from before, we'll retry the read. In DFSInputStream, added a new seek method which also requests a datanode other than the current node.
          Hide
          Hadoop QA added a comment -

          +1, because http://issues.apache.org/jira/secure/attachment/12349420/hadoop-731-7.patch applied and successfully tested against trunk revision r499156.

          Show
          Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12349420/hadoop-731-7.patch applied and successfully tested against trunk revision r499156.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Wendy!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Wendy!

            People

            • Assignee:
              Wendy Chien
              Reporter:
              Dick King
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development