Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2781

Add client protocol and DFSadmin for command to restore failed storage

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.24.0
    • Fix Version/s: None
    • Component/s: hdfs-client, namenode
    • Labels:
      None

      Description

      Per HDFS-2769, it's important that an admin be able to ask the NN to try to restore failed storage since we may drop into SM until the shared edits dir is restored (w/o having to wait for the next checkpoint). There's currently an API (and usage in DFSAdmin) to flip the flag indicating whether the NN should try to restore failed storage but not that it should actually attempt to do so. This jira is to add one. This is useful outside HA but doing as an HDFS-1623 sub-task since it's motivated by HA.

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Currently, if the shared edits goes away, we don't drop into safe mode, but rather abort the NN completely. So we probably need a different task (non-HA-specific) to allow the NN to drop to safemode instead of aborting.

          Show
          Todd Lipcon added a comment - Currently, if the shared edits goes away, we don't drop into safe mode, but rather abort the NN completely. So we probably need a different task (non-HA-specific) to allow the NN to drop to safemode instead of aborting.
          Hide
          Bikas Saha added a comment -

          I renamed the shared edits dir. The following happened
          1) Active moved to safe mode. So it seems the above observation has already been fixed
          2) Standby crashed with NPE HDFS-2905

          Also, when the shared edits is brought back online (renaming it back) and the active is moved out of safe mode, then it starts re-using that directory when the standby rolls the edits.

          Show
          Bikas Saha added a comment - I renamed the shared edits dir. The following happened 1) Active moved to safe mode. So it seems the above observation has already been fixed 2) Standby crashed with NPE HDFS-2905 Also, when the shared edits is brought back online (renaming it back) and the active is moved out of safe mode, then it starts re-using that directory when the standby rolls the edits.
          Hide
          Bikas Saha added a comment -

          Actually the active goes into safe mode on my machine because it thinks there is not enough space.

          12/02/06 15:47:19 WARN namenode.FSNamesystem: NameNode low on available disk space. Already in safe mode.
          12/02/06 15:47:19 INFO hdfs.StateChange: STATE* Safe mode is ON. Resources are low on NN. Safe mode must be turned off manually.
          12/02/06 15:47:24 WARN namenode.NameNodeResourceChecker: Space available on volume '/dev/disk0s2' is 0, which is below the configured reserved amount 104857600

          So it might be that if the space constraint is removed then it might abort differently.

          Show
          Bikas Saha added a comment - Actually the active goes into safe mode on my machine because it thinks there is not enough space. 12/02/06 15:47:19 WARN namenode.FSNamesystem: NameNode low on available disk space. Already in safe mode. 12/02/06 15:47:19 INFO hdfs.StateChange: STATE* Safe mode is ON. Resources are low on NN. Safe mode must be turned off manually. 12/02/06 15:47:24 WARN namenode.NameNodeResourceChecker: Space available on volume '/dev/disk0s2' is 0, which is below the configured reserved amount 104857600 So it might be that if the space constraint is removed then it might abort differently.
          Hide
          Aaron T. Myers added a comment -

          Actually the active goes into safe mode on my machine because it thinks there is not enough space.

          That's very likely a parsing error. As I recall, DF (as in, the Java class) which NameNodeResourceChecker uses just quietly returns 0 for free space of a directory that doesn't exist:

          $ hadoop org.apache.hadoop.fs.DF /does/not/exist
          df -k null
          null	0	0	0	0%	null
          
          Show
          Aaron T. Myers added a comment - Actually the active goes into safe mode on my machine because it thinks there is not enough space. That's very likely a parsing error. As I recall, DF (as in, the Java class) which NameNodeResourceChecker uses just quietly returns 0 for free space of a directory that doesn't exist: $ hadoop org.apache.hadoop.fs.DF /does/not/exist df -k null null 0 0 0 0% null
          Hide
          Bikas Saha added a comment -

          When I break on that code line, the change to safe mode is being triggered by the NameNodeResourceChecker returning false for resources available.
          So DF returning 0 is what is causing the safe mode transition to occur.

          What do you mean by parse error? Are you suggesting that the check for available space be replaced by something else when the available space == 0. Something that will actually check if the directory exists or not?

          Show
          Bikas Saha added a comment - When I break on that code line, the change to safe mode is being triggered by the NameNodeResourceChecker returning false for resources available. So DF returning 0 is what is causing the safe mode transition to occur. What do you mean by parse error? Are you suggesting that the check for available space be replaced by something else when the available space == 0. Something that will actually check if the directory exists or not?
          Hide
          Aaron T. Myers added a comment -

          What do you mean by parse error?

          Sorry, calling it a "parse error" isn't quite accurate. I've just always found it a little suspect that DF returns 0 for space available if the given path doesn't exist. It should probably throw an error or something along those lines.

          Also, note that the NameNodeResourceChecker can't really be considered helpful in this case, since it runs asynchronously. i.e. the NN might continue not in SM for a while (a minute by default, I think) before the NNResourceChecker runs and moves the NN into SM.

          Show
          Aaron T. Myers added a comment - What do you mean by parse error? Sorry, calling it a "parse error" isn't quite accurate. I've just always found it a little suspect that DF returns 0 for space available if the given path doesn't exist. It should probably throw an error or something along those lines. Also, note that the NameNodeResourceChecker can't really be considered helpful in this case, since it runs asynchronously. i.e. the NN might continue not in SM for a while (a minute by default, I think) before the NNResourceChecker runs and moves the NN into SM.
          Hide
          Todd Lipcon added a comment -

          I think if you were continuously writing to the active NN when the disk went offline, you'd see it abort. Doing a deletion of the directory allows the logs to still fsync (since the vnode still exists in memory despite not having any file system links to it anymore). On the next roll you'd probably see it abort with a FATAL message, rather than go into safe mode, so long as the roll happened before the periodic resource check interval.

          Show
          Todd Lipcon added a comment - I think if you were continuously writing to the active NN when the disk went offline, you'd see it abort. Doing a deletion of the directory allows the logs to still fsync (since the vnode still exists in memory despite not having any file system links to it anymore). On the next roll you'd probably see it abort with a FATAL message, rather than go into safe mode, so long as the roll happened before the periodic resource check interval.
          Hide
          Bikas Saha added a comment -

          Is this JIRA still valid? If I understand right, the premise was the the NN would fall into standby mode when the shared edits dir fails. After the shared edits dir is restored, the admin could use the command proposed in this JIRA to refresh the dirs.
          But current policy is for the NN to shutdown on shared edits dir failure. When the dir is brought back online, then the NN will pick it up on being restarted.
          When NN moves to active or standby states then the FSEditLog.journalSet is refreshed and will refresh the storage dirs upon next log roll (if the restore flag is set). Perhaps we are better off restoring directories as part of moving from active/standby states (when we re-init the JournalSet) instead of as an explicit command. Seems more natural and 1 less thing to do for the admin.

          Show
          Bikas Saha added a comment - Is this JIRA still valid? If I understand right, the premise was the the NN would fall into standby mode when the shared edits dir fails. After the shared edits dir is restored, the admin could use the command proposed in this JIRA to refresh the dirs. But current policy is for the NN to shutdown on shared edits dir failure. When the dir is brought back online, then the NN will pick it up on being restarted. When NN moves to active or standby states then the FSEditLog.journalSet is refreshed and will refresh the storage dirs upon next log roll (if the restore flag is set). Perhaps we are better off restoring directories as part of moving from active/standby states (when we re-init the JournalSet) instead of as an explicit command. Seems more natural and 1 less thing to do for the admin.
          Hide
          Bikas Saha added a comment -

          Or perhaps storage dirs could restored when the dfsAdmin -restoreFailedStorage command sets the option to true (as part of the command).
          This would handle the non-HA cases.

          Show
          Bikas Saha added a comment - Or perhaps storage dirs could restored when the dfsAdmin -restoreFailedStorage command sets the option to true (as part of the command). This would handle the non-HA cases.
          Hide
          Eli Collins added a comment -

          If we change the behavior such that the NN drops into SM if it can't access the shared edits dir (we decided that's the desired behavior right?) then we'll still need this. We could make restoreFailedStorage (which flips the flag) also have the side effect of trying to restore shared storage though I'm not sure that's user friendly, eg if storage restoration is already enabled you mihht not think that you should try to enable it to get this side effect.

          Show
          Eli Collins added a comment - If we change the behavior such that the NN drops into SM if it can't access the shared edits dir (we decided that's the desired behavior right?) then we'll still need this. We could make restoreFailedStorage (which flips the flag) also have the side effect of trying to restore shared storage though I'm not sure that's user friendly, eg if storage restoration is already enabled you mihht not think that you should try to enable it to get this side effect.
          Hide
          Todd Lipcon added a comment -

          There's some interaction with fencing, here, though... one likely reason that the NN will lose touch with the shared storage is that another node has requested that the NAS device fence the host. Then, after the failover, the administrator might unfence the host from the NAS, and we don't want the NN to automatically "come back to life" at this point.

          Show
          Todd Lipcon added a comment - There's some interaction with fencing, here, though... one likely reason that the NN will lose touch with the shared storage is that another node has requested that the NAS device fence the host. Then, after the failover, the administrator might unfence the host from the NAS, and we don't want the NN to automatically "come back to life" at this point.
          Hide
          Eli Collins added a comment -

          Can we define this away? Eg if a standby loses connection to shared storage it should probably shutdown gracefully vs keeping running, in which case we only restore failed storage on an active, and if the active has lost it's connection to shared storage it will be in SM (or not running), in which case restoring shared storage should cause it to "come back to life."

          Show
          Eli Collins added a comment - Can we define this away? Eg if a standby loses connection to shared storage it should probably shutdown gracefully vs keeping running, in which case we only restore failed storage on an active, and if the active has lost it's connection to shared storage it will be in SM (or not running), in which case restoring shared storage should cause it to "come back to life."
          Hide
          Bikas Saha added a comment -

          friendly, eg if storage restoration is already enabled you mihht not think that you should try to enable it to get this side effect.

          In that case, rolling logs will restore the directories just like it works as of now
          HA imposes higher restrictions compared to what works as of now. So we might need to do special stuff for HA only. Which might be trying to restore failed directories in the process of transitioning to active (maybe also standby)
          From what I read of the code, the standby doesnt seem to bother with setting failed directories since its operations are all read only. So there might be no need for the standby to shutdown gracefully.
          If the active moves to SM because of a bad required directory then it should restore all required directories when it goes out of safe mode or else complain and stay in safe mode. All this should happen after the admin has done the necessary pre-requisites and issued a -safeMode leave command.

          There's some interaction with fencing, here, though... one likely reason that the NN will lose touch with the shared storage is that another node has requested that the NAS device fence the host. Then, after the failover, the administrator might unfence the host from the NAS, and we don't want the NN to automatically "come back to life" at this point.

          Does the NN come back out of safemode automatically or only after an admin command?

          Show
          Bikas Saha added a comment - friendly, eg if storage restoration is already enabled you mihht not think that you should try to enable it to get this side effect. In that case, rolling logs will restore the directories just like it works as of now HA imposes higher restrictions compared to what works as of now. So we might need to do special stuff for HA only. Which might be trying to restore failed directories in the process of transitioning to active (maybe also standby) From what I read of the code, the standby doesnt seem to bother with setting failed directories since its operations are all read only. So there might be no need for the standby to shutdown gracefully. If the active moves to SM because of a bad required directory then it should restore all required directories when it goes out of safe mode or else complain and stay in safe mode. All this should happen after the admin has done the necessary pre-requisites and issued a -safeMode leave command. There's some interaction with fencing, here, though... one likely reason that the NN will lose touch with the shared storage is that another node has requested that the NAS device fence the host. Then, after the failover, the administrator might unfence the host from the NAS, and we don't want the NN to automatically "come back to life" at this point. Does the NN come back out of safemode automatically or only after an admin command?
          Hide
          Todd Lipcon added a comment -

          We can move this to be a non-HA ticket right?

          Show
          Todd Lipcon added a comment - We can move this to be a non-HA ticket right?
          Hide
          Aaron T. Myers added a comment -

          Converted this to a top-level issue per Todd's comment.

          Show
          Aaron T. Myers added a comment - Converted this to a top-level issue per Todd's comment.

            People

            • Assignee:
              Brandon Li
              Reporter:
              Aaron T. Myers
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development