Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2422

The NN should tolerate the same number of low-resource volumes as failed volumes

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      We encountered a situation where the namenode dropped into safe mode after a temporary outage of an NFS mount.

      At 12:10 the NFS server goes offline

      Oct 8 12:10:05 <namenode> kernel: nfs: server <nfs host> not responding, timed out

      This caused the namenode to conclude resource issues:

      2011-10-08 12:10:34,848 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space available on volume '<nfs host>' is 0, which is below the configured reserved amount 104857600

      Temporary loss of NFS mount shouldn't cause safemode.

      1. HDFS-2422.patch
        7 kB
        Aaron T. Myers

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #858 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/858/)
          Move line for HDFS-2422 under 0.23 instead of 0.24.

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182220
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #858 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/858/ ) Move line for HDFS-2422 under 0.23 instead of 0.24. atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182220 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-0.23-Build #46 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/46/)
          svn merge -c 1181316 from trunk to fix HDFS-2422

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182219
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #46 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/46/ ) svn merge -c 1181316 from trunk to fix HDFS-2422 atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182219 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #37 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/37/)
          svn merge -c 1181316 from trunk to fix HDFS-2422

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182219
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #37 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/37/ ) svn merge -c 1181316 from trunk to fix HDFS-2422 atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182219 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #828 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/828/)
          Move line for HDFS-2422 under 0.23 instead of 0.24.

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182220
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #828 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/828/ ) Move line for HDFS-2422 under 0.23 instead of 0.24. atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182220 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #1081 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1081/)
          Move line for HDFS-2422 under 0.23 instead of 0.24.

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182220
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1081 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1081/ ) Move line for HDFS-2422 under 0.23 instead of 0.24. atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182220 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #1061 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1061/)
          Move line for HDFS-2422 under 0.23 instead of 0.24.

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182220
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1061 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1061/ ) Move line for HDFS-2422 under 0.23 instead of 0.24. atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182220 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #1139 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1139/)
          Move line for HDFS-2422 under 0.23 instead of 0.24.

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182220
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1139 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1139/ ) Move line for HDFS-2422 under 0.23 instead of 0.24. atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182220 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          Aaron T. Myers added a comment -

          Thanks for the suggestion, guys. I've just back-ported this to 0.23.

          Show
          Aaron T. Myers added a comment - Thanks for the suggestion, guys. I've just back-ported this to 0.23.
          Hide
          Todd Lipcon added a comment -

          Yep, all of that should be fine - once we fail on a mount, we mark that log as corrupt by rolling the other logs. On startup, it will use a finalized log in preference over one that was chopped in the middle. If they're all chopped in the middle, we perform validation using checksums, etc. So I don't think there's any issue here.

          Show
          Todd Lipcon added a comment - Yep, all of that should be fine - once we fail on a mount, we mark that log as corrupt by rolling the other logs. On startup, it will use a finalized log in preference over one that was chopped in the middle. If they're all chopped in the middle, we perform validation using checksums, etc. So I don't think there's any issue here.
          Hide
          M. C. Srivas added a comment -

          @Todd:

          With soft mounts, if the server goes down, I'd expect that the fsync would fail. However, you wouldn't have any guarantee about what happened to all of the previous writes since the last successful fsync through the new failed fsync. SOme of them might have succeeded and some might get lost. Conceivably, some of them might get performed again when the server recovers. So, I'd recommend that once you switch from one log to another, that you unlink the previous one when you get the chance before using it again, just to make sure you don't get any ghost writes showing up later.

          Show
          M. C. Srivas added a comment - @Todd: With soft mounts, if the server goes down, I'd expect that the fsync would fail. However, you wouldn't have any guarantee about what happened to all of the previous writes since the last successful fsync through the new failed fsync. SOme of them might have succeeded and some might get lost. Conceivably, some of them might get performed again when the server recovers. So, I'd recommend that once you switch from one log to another, that you unlink the previous one when you get the chance before using it again, just to make sure you don't get any ghost writes showing up later.
          Hide
          Todd Lipcon added a comment -

          Can a soft mount cause missing pages even if you are fsyncing?

          ATM: mind committing this to 0.23 as well? I agree it should be in our next release.

          Show
          Todd Lipcon added a comment - Can a soft mount cause missing pages even if you are fsyncing? ATM: mind committing this to 0.23 as well? I agree it should be in our next release.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #857 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/857/)
          HDFS-2422. The NN should tolerate the same number of low-resource volumes as failed volumes (atm)

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181316
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #857 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/857/ ) HDFS-2422 . The NN should tolerate the same number of low-resource volumes as failed volumes (atm) atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181316 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Hide
          Steve Loughran added a comment -

          I'm with Kos here, this should be a 0.23 patch too, unless anyone says no

          Show
          Steve Loughran added a comment - I'm with Kos here, this should be a 0.23 patch too, unless anyone says no
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #827 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/827/)
          HDFS-2422. The NN should tolerate the same number of low-resource volumes as failed volumes (atm)

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181316
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #827 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/827/ ) HDFS-2422 . The NN should tolerate the same number of low-resource volumes as failed volumes (atm) atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181316 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Hide
          Konstantin Shvachko added a comment -

          Sounds like soft NFS mounts should be avoided as the integrity is the priority for edits and image. CRCs would help to detect corruption, but we need to guarantee that each record is written, flushed, and synced. Thanks Srivas.

          Show
          Konstantin Shvachko added a comment - Sounds like soft NFS mounts should be avoided as the integrity is the priority for edits and image. CRCs would help to detect corruption, but we need to guarantee that each record is written, flushed, and synced. Thanks Srivas.
          Hide
          M. C. Srivas added a comment -

          Konstantin and Todd, should the timeout be short, or long?

          From the NFS FAQ ... http://nfs.sourceforge.net/#faq_e4 ... soft mounts can cause silent data corruption, even in the middle of a file, when a brief outage occurs. Thus, during recovery, even though the edits-log looks up-to-date, it might contain bad pages in the middle.

          If you wish to use soft-mounts, then the recovery process should verify all the logs before picking one of them to use for replay. (I am not sure if there are CRCs on every record of the edits-log .. are there?)

          Otherwise, with soft-mounts, you will hit issues like HDFS-1382.

          Show
          M. C. Srivas added a comment - Konstantin and Todd, should the timeout be short, or long? From the NFS FAQ ... http://nfs.sourceforge.net/#faq_e4 ... soft mounts can cause silent data corruption, even in the middle of a file, when a brief outage occurs. Thus, during recovery, even though the edits-log looks up-to-date, it might contain bad pages in the middle. If you wish to use soft-mounts, then the recovery process should verify all the logs before picking one of them to use for replay. (I am not sure if there are CRCs on every record of the edits-log .. are there?) Otherwise, with soft-mounts, you will hit issues like HDFS-1382 .
          Hide
          Konstantin Shvachko added a comment -

          M.C. as far as I know this is exactly the case: the NFS drive has been soft mounted. So the solution is either to hard mount the drive or set a large enough timeout for the soft mount.
          The patch though fixes another bug, which brings NameNode into safe mode if a single drive goes low on disk space even though there are other drives that can be used for journaling and saving images.
          It is introduced by HDFS-1594, so I'd recommend it for inclusion to 0.23.

          Show
          Konstantin Shvachko added a comment - M.C. as far as I know this is exactly the case: the NFS drive has been soft mounted. So the solution is either to hard mount the drive or set a large enough timeout for the soft mount. The patch though fixes another bug, which brings NameNode into safe mode if a single drive goes low on disk space even though there are other drives that can be used for journaling and saving images. It is introduced by HDFS-1594 , so I'd recommend it for inclusion to 0.23.
          Hide
          Todd Lipcon added a comment -

          We can handle a mount going away - recommended config is to soft mount with a reasonably short timeout.

          Show
          Todd Lipcon added a comment - We can handle a mount going away - recommended config is to soft mount with a reasonably short timeout.
          Hide
          M. C. Srivas added a comment -

          This patch does not really help. If one is using a NFS server, then one must hard-mount the server in order for the data-writes to be reliable. But when hard-mounted, the NFS client (ie, the NN machine) will hang until the NFS server recovers.

          Show
          M. C. Srivas added a comment - This patch does not really help. If one is using a NFS server, then one must hard-mount the server in order for the data-writes to be reliable. But when hard-mounted, the NFS client (ie, the NN machine) will hang until the NFS server recovers.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #1072 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1072/)
          HDFS-2422. The NN should tolerate the same number of low-resource volumes as failed volumes (atm)

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181316
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1072 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1072/ ) HDFS-2422 . The NN should tolerate the same number of low-resource volumes as failed volumes (atm) atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181316 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #1130 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1130/)
          HDFS-2422. The NN should tolerate the same number of low-resource volumes as failed volumes (atm)

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181316
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1130 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1130/ ) HDFS-2422 . The NN should tolerate the same number of low-resource volumes as failed volumes (atm) atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181316 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #1052 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1052/)
          HDFS-2422. The NN should tolerate the same number of low-resource volumes as failed volumes (atm)

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181316
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1052 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1052/ ) HDFS-2422 . The NN should tolerate the same number of low-resource volumes as failed volumes (atm) atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181316 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeResourceChecker.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNameNodeResourceChecker.java
          Hide
          Aaron T. Myers added a comment -

          Thanks a lot for the review, Todd. I've just committed this to trunk.

          Show
          Aaron T. Myers added a comment - Thanks a lot for the review, Todd. I've just committed this to trunk.
          Hide
          Todd Lipcon added a comment -

          +1 on this patch, though. It makes sense that we only need to go to safemode if all of the volumes are low.

          Show
          Todd Lipcon added a comment - +1 on this patch, though. It makes sense that we only need to go to safemode if all of the volumes are low.
          Hide
          Milind Bhandarkar added a comment -

          @Todd, of course! I had a temp brain freeze

          Show
          Milind Bhandarkar added a comment - @Todd, of course! I had a temp brain freeze
          Hide
          Todd Lipcon added a comment -

          Can't we check if available space == 0 && used space == 0? If so, it's probably a dead mount, which is different than a full mount (where used >>> 0)

          Show
          Todd Lipcon added a comment - Can't we check if available space == 0 && used space == 0? If so, it's probably a dead mount, which is different than a full mount (where used >>> 0)
          Hide
          Aaron T. Myers added a comment -

          The transient loss of connectivity to an NFS mount currently reflects as if the NFS mount is low on space (in fact, has 0 space left). This is unfortunate. If there were a way to distinguish between the two, (I cannot think of any, but others may have an answer), it would be ideal to have namenode come out of safe mode automatically when the transient error goes away.

          I'm afraid I also can't think of a way to reliably distinguish between the two. We could, for example, check that the directory actually exists (which it would not, in the case the NFS mount disappears and the configured dfs.name.dir were a subdirectory of the mount), but that could obviously be conflated with other issues besides NFS mount failure.

          Even if there were a way to distinguish between the two, I would probably argue for not entering SM in the first place, but that's a separate issue.

          Show
          Aaron T. Myers added a comment - The transient loss of connectivity to an NFS mount currently reflects as if the NFS mount is low on space (in fact, has 0 space left). This is unfortunate. If there were a way to distinguish between the two, (I cannot think of any, but others may have an answer), it would be ideal to have namenode come out of safe mode automatically when the transient error goes away. I'm afraid I also can't think of a way to reliably distinguish between the two. We could, for example, check that the directory actually exists (which it would not, in the case the NFS mount disappears and the configured dfs.name.dir were a subdirectory of the mount), but that could obviously be conflated with other issues besides NFS mount failure. Even if there were a way to distinguish between the two, I would probably argue for not entering SM in the first place, but that's a separate issue.
          Hide
          Milind Bhandarkar added a comment -

          I agree with the "low on space" argument by Eli.

          The transient loss of connectivity to an NFS mount currently reflects as if the NFS mount is low on space (in fact, has 0 space left). This is unfortunate. If there were a way to distinguish between the two, (I cannot think of any, but others may have an answer), it would be ideal to have namenode come out of safe mode automatically when the transient error goes away.

          Show
          Milind Bhandarkar added a comment - I agree with the "low on space" argument by Eli. The transient loss of connectivity to an NFS mount currently reflects as if the NFS mount is low on space (in fact, has 0 space left). This is unfortunate. If there were a way to distinguish between the two, (I cannot think of any, but others may have an answer), it would be ideal to have namenode come out of safe mode automatically when the transient error goes away.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12498482/HDFS-2422.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1358//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1358//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12498482/HDFS-2422.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1358//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1358//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          Thanks for the comments Milind.

          Aaron, the failed volume policy should ensure that at least two volumes are up when writing edit logs. If it were only writing to one volume, and staying writable, then there is a time period when there is a single up-to-date replica of edit logs that can fail and lose modifications (that is why I said the window opens for losing data, anot that it will definitely lose data.).

          Ah, I misunderstood your earlier comment. That seems reasonable to me. I've filed HDFS-2430 to address this issue.

          re: automatically coming out of safemode, I think transient unavailability of a volume, and a volume being low on disk space should be treated differently. While the second case requires admin intervention, the first case does not.

          Do you disagree with the reasoning Eli posted in the comment I linked to earlier? I found his argument quite compelling. If so, we should probably file a separate JIRA for that, along the lines of "The NN should automatically leave SM if sufficient resources become available again after they were previously low" and continue the discussion there.

          Show
          Aaron T. Myers added a comment - Thanks for the comments Milind. Aaron, the failed volume policy should ensure that at least two volumes are up when writing edit logs. If it were only writing to one volume, and staying writable, then there is a time period when there is a single up-to-date replica of edit logs that can fail and lose modifications (that is why I said the window opens for losing data, anot that it will definitely lose data.). Ah, I misunderstood your earlier comment. That seems reasonable to me. I've filed HDFS-2430 to address this issue. re: automatically coming out of safemode, I think transient unavailability of a volume, and a volume being low on disk space should be treated differently. While the second case requires admin intervention, the first case does not. Do you disagree with the reasoning Eli posted in the comment I linked to earlier? I found his argument quite compelling. If so, we should probably file a separate JIRA for that, along the lines of "The NN should automatically leave SM if sufficient resources become available again after they were previously low" and continue the discussion there.
          Hide
          Milind Bhandarkar added a comment -

          Aaron, the failed volume policy should ensure that at least two volumes are up when writing edit logs. If it were only writing to one volume, and staying writable, then there is a time period when there is a single up-to-date replica of edit logs that can fail and lose modifications (that is why I said the window opens for losing data, anot that it will definitely lose data.).

          re: automatically coming out of safemode, I think transient unavailability of a volume, and a volume being low on disk space should be treated differently. While the second case requires admin intervention, the first case does not.

          Show
          Milind Bhandarkar added a comment - Aaron, the failed volume policy should ensure that at least two volumes are up when writing edit logs. If it were only writing to one volume, and staying writable, then there is a time period when there is a single up-to-date replica of edit logs that can fail and lose modifications (that is why I said the window opens for losing data, anot that it will definitely lose data.). re: automatically coming out of safemode, I think transient unavailability of a volume, and a volume being low on disk space should be treated differently. While the second case requires admin intervention, the first case does not.
          Hide
          Aaron T. Myers added a comment -

          Here's a patch that addresses the issue. I also took the opportunity to clean up some of the other tests in TestNameNodeResourceChecker.java while I was in there.

          Show
          Aaron T. Myers added a comment - Here's a patch that addresses the issue. I also took the opportunity to clean up some of the other tests in TestNameNodeResourceChecker.java while I was in there.
          Hide
          Aaron T. Myers added a comment -

          Thanks a lot for the comments, Milind. Answers inline.

          I think it is a "good thing" (tm) that NN makes HDFS readonly when nfs is not accessible.

          I can see arguments for both. In fact, I originally argued in favor of the behavior you're describing. Upon further reflection, I think I've changed my opinion, however. At least, whatever policy is being used for the number of failed volumes that can be tolerated when syncing edit logs should also be used when checking for available resources in the NameNodeResourceChecker, for the purpose of consistency.

          HDFS is getting public criticism about "losing" data, and if hdfs modifications are allowed by modifying a single destination, then it open up a window for losing data.

          The purpose of configuring multiple dfs.name.dir directories is exactly so that the NN can tolerate multiple failures and keep on humming. It's not going to lose any data just because one goes offline - it will just write to the other directories.

          The right thing to do is to return from safemode when the NFS volume becomes available again.

          Please see this comment for the reasoning as to why the NameNodeResourceChecker doesn't automatically take the NN out of SM when it detects a volume being low on space.

          Show
          Aaron T. Myers added a comment - Thanks a lot for the comments, Milind. Answers inline. I think it is a "good thing" (tm) that NN makes HDFS readonly when nfs is not accessible. I can see arguments for both. In fact, I originally argued in favor of the behavior you're describing. Upon further reflection, I think I've changed my opinion, however. At least, whatever policy is being used for the number of failed volumes that can be tolerated when syncing edit logs should also be used when checking for available resources in the NameNodeResourceChecker , for the purpose of consistency. HDFS is getting public criticism about "losing" data, and if hdfs modifications are allowed by modifying a single destination, then it open up a window for losing data. The purpose of configuring multiple dfs.name.dir directories is exactly so that the NN can tolerate multiple failures and keep on humming. It's not going to lose any data just because one goes offline - it will just write to the other directories. The right thing to do is to return from safemode when the NFS volume becomes available again. Please see this comment for the reasoning as to why the NameNodeResourceChecker doesn't automatically take the NN out of SM when it detects a volume being low on space.
          Hide
          Milind Bhandarkar added a comment -

          I think it is a "good thing" (tm) that NN makes HDFS readonly when nfs is not accessible. HDFS is getting public criticism about "losing" data, and if hdfs modifications are allowed by modifying a single destination, then it open up a window for losing data.

          The right thing to do is to return from safemode when the NFS volume becomes available again.

          Show
          Milind Bhandarkar added a comment - I think it is a "good thing" (tm) that NN makes HDFS readonly when nfs is not accessible. HDFS is getting public criticism about "losing" data, and if hdfs modifications are allowed by modifying a single destination, then it open up a window for losing data. The right thing to do is to return from safemode when the NFS volume becomes available again.
          Hide
          Aaron T. Myers added a comment -

          Looks like this is happening because o.a.h.fs.DF will return 0 for "space available" on a directory which doesn't exist:

          [01:29:11] atm@simon:~$ hadoop org.apache.hadoop.fs.DF /
          df -k null
          null	72718632	49480712	19543996	73%	null
          [01:29:23] atm@simon:~$ hadoop org.apache.hadoop.fs.DF /foo/bar/baz
          df -k null
          null	0	0	0	0%	null
          

          I'm guessing the particular dfs.name.dir the NN was writing to was in fact a subdirectory of the mount directory, so when the NFS mount went away so did the subdirectory, causing DF to return 0.

          I think this is indicative of a more basic issue with the NNResourceChecker policy, though. When syncing edit logs, the NN is designed to tolerate failure of up to N-1 dfs.name.dirs, but the NNResourceChecker will put the NN into safemode if only a single dfs.name.dir is low on space. The appropriate solution, then, seems to me to be to change the NNResourceChecker to also tolerate up to N-1 directories being low on space.

          I'll create a patch to do this and upload it shortly.

          Show
          Aaron T. Myers added a comment - Looks like this is happening because o.a.h.fs.DF will return 0 for "space available" on a directory which doesn't exist: [01:29:11] atm@simon:~$ hadoop org.apache.hadoop.fs.DF / df -k null null 72718632 49480712 19543996 73% null [01:29:23] atm@simon:~$ hadoop org.apache.hadoop.fs.DF /foo/bar/baz df -k null null 0 0 0 0% null I'm guessing the particular dfs.name.dir the NN was writing to was in fact a subdirectory of the mount directory, so when the NFS mount went away so did the subdirectory, causing DF to return 0. I think this is indicative of a more basic issue with the NNResourceChecker policy, though. When syncing edit logs, the NN is designed to tolerate failure of up to N-1 dfs.name.dirs , but the NNResourceChecker will put the NN into safemode if only a single dfs.name.dir is low on space. The appropriate solution, then, seems to me to be to change the NNResourceChecker to also tolerate up to N-1 directories being low on space. I'll create a patch to do this and upload it shortly.

            People

            • Assignee:
              Aaron T. Myers
              Reporter:
              Jeff Bean
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development