Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6706

ZKFailoverController failed to recognize the quorum is not met

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Thanks Kenny Zhang for finding this problem.
      The zkfc cannot be startup due to ha.zookeeper.quorum is not met. "zkfc -format" doesn't log the real problem. And then user will see the error message instead of the real issue when starting zkfc:
      2014-07-01 17:08:17,528 FATAL ha.ZKFailoverController (ZKFailoverController.java:doRun(213)) - Unable to start failover controller. Parent znode does not exist.
      Run with -formatZK flag to initialize ZooKeeper.

      2014-07-01 16:00:48,678 FATAL ha.ZKFailoverController (ZKFailoverController.java:fatalError(365)) - Fatal error occurred:Received create error from Zookeeper. code:NONODE for path /hadoop-ha/prodcluster/ActiveStandbyElectorLock
      2014-07-01 17:24:44,202 - INFO ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@627 - Got user-level KeeperException when processing sessionid:0x346f36191250005 type:create cxid:0x2 zxid:0xf00000033 txntype:-1 reqpath:n/a Error Path:/hadoop-ha/prodcluster/ActiveStandbyElectorLock Error:KeeperErrorCode = NodeExists for /hadoop-ha/prodcluster/ActiveStandbyElectorLock

      To reproduce the problem:
      1. use HDFS cluster with automatic HA enable and set the ha.zookeeper.quorum to 3.
      2. start two zookeeper servers.
      3. do "hdfs zkfc -format", and then "hdfs zkfc"

        Activity

        Hide
        Yongjun Zhang added a comment -

        HI Brandon Li, thanks for the explanation!

        Show
        Yongjun Zhang added a comment - HI Brandon Li , thanks for the explanation!
        Hide
        Brandon Li added a comment -

        Yongjun Zhang, I should have added more explanation.
        In the case, the 3 zookeeper was not correctly configured as an ensemble. Basically none of them was in an ensemble. However, all of them were configured in core-site.xml in "ha.zookeeper.quorum".
        When zkfs is started, it talked to a different zk server which is not the previously formatted one.

        Show
        Brandon Li added a comment - Yongjun Zhang , I should have added more explanation. In the case, the 3 zookeeper was not correctly configured as an ensemble. Basically none of them was in an ensemble. However, all of them were configured in core-site.xml in "ha.zookeeper.quorum". When zkfs is started, it talked to a different zk server which is not the previously formatted one.
        Hide
        Yongjun Zhang added a comment -

        HI Brandon Li,

        Thanks for reporting and addressing the issue. I have some questions here. The original report seems to indicate that the reported error message doesn't indicate the real reason of failure. My questions are,
        1. In the case reported initially, the real problem was said to be "The zkfc cannot be startup due to ha.zookeeper.quorum is not met". With your last update, can we say the real problem is a misconfiguration?
        2. What kind of misconfiguration caused the symptom?
        3. When misconfigured, user will still see the reported error message. Should we have the error message to tell that the symptom is caused by the possible misconfiguration?

        Thanks.
        .

        Show
        Yongjun Zhang added a comment - HI Brandon Li , Thanks for reporting and addressing the issue. I have some questions here. The original report seems to indicate that the reported error message doesn't indicate the real reason of failure. My questions are, 1. In the case reported initially, the real problem was said to be "The zkfc cannot be startup due to ha.zookeeper.quorum is not met". With your last update, can we say the real problem is a misconfiguration? 2. What kind of misconfiguration caused the symptom? 3. When misconfigured, user will still see the reported error message. Should we have the error message to tell that the symptom is caused by the possible misconfiguration? Thanks. .
        Hide
        Brandon Li added a comment -

        With further investigation, we found this is due to a misconfiguration. Closing as invalid.

        Show
        Brandon Li added a comment - With further investigation, we found this is due to a misconfiguration. Closing as invalid.

          People

          • Assignee:
            Brandon Li
            Reporter:
            Brandon Li
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development