Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3042 Automatic failover support for NN HA
  3. HDFS-3217

ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING exception

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: auto-failover, ha
    • Labels:
      None

      Activity

      Hide
      Todd Lipcon added a comment -

      We had some offline discussion about this and it seems like folks are on board with the current design (i.e. that the failover controller doesn't directly launch or restart the namenode). We can revisit in the future should the current design prove problematic. Resolving as wontfix for now.

      Show
      Todd Lipcon added a comment - We had some offline discussion about this and it seems like folks are on board with the current design (i.e. that the failover controller doesn't directly launch or restart the namenode). We can revisit in the future should the current design prove problematic. Resolving as wontfix for now.
      Hide
      Hari Mankude added a comment -

      I disagree. It is an explicit decision to not have the ZKFC act as a service supervisor, because it adds a lot of complexity. There already exist lots of solutions for service management - we assume that the user is already using something like puppet, daemontools, supervisord, cron, etc, to make sure the daemon restarts eventually.

      I did not find a reference to an external monitoring tool in the HA design docs. So apologies there. If the scanning interval of the external tools is significant, it might still make sense for FC to restart the NN directly. With one of the NN processes down, the cluster is functioning in a degraded state and the longer it takes to restart the standby NN process, longer the recovery time is going to be.

      Show
      Hari Mankude added a comment - I disagree. It is an explicit decision to not have the ZKFC act as a service supervisor, because it adds a lot of complexity. There already exist lots of solutions for service management - we assume that the user is already using something like puppet, daemontools, supervisord, cron, etc, to make sure the daemon restarts eventually. I did not find a reference to an external monitoring tool in the HA design docs. So apologies there. If the scanning interval of the external tools is significant, it might still make sense for FC to restart the NN directly. With one of the NN processes down, the cluster is functioning in a degraded state and the longer it takes to restart the standby NN process, longer the recovery time is going to be.
      Hide
      Todd Lipcon added a comment -

      I disagree. It is an explicit decision to not have the ZKFC act as a service supervisor, because it adds a lot of complexity. There already exist lots of solutions for service management - we assume that the user is already using something like puppet, daemontools, supervisord, cron, etc, to make sure the daemon restarts eventually.

      Show
      Todd Lipcon added a comment - I disagree. It is an explicit decision to not have the ZKFC act as a service supervisor, because it adds a lot of complexity. There already exist lots of solutions for service management - we assume that the user is already using something like puppet, daemontools, supervisord, cron, etc, to make sure the daemon restarts eventually.
      Hide
      Hari Mankude added a comment -

      ZKFC should restart NN when it sees a SERVICE_NOT_RESPONDING exception. NN might have aborted due to loss of quorum and unless there is manual intervention, NN will not be restarted.

      Show
      Hari Mankude added a comment - ZKFC should restart NN when it sees a SERVICE_NOT_RESPONDING exception. NN might have aborted due to loss of quorum and unless there is manual intervention, NN will not be restarted.

        People

        • Assignee:
          Hari Mankude
          Reporter:
          Hari Mankude
        • Votes:
          0 Vote for this issue
          Watchers:
          3 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development