Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:

      Description

      Setting both namenodes active and then trying to turn one to standby results in a NullPointerException and the NameNode process is killed.

        Activity

        Hide
        Eli Collins added a comment -
        [root@c0405 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn1
        active
        [root@c0405 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn2
        standby
        [root@c0405 ~]# sudo -u hdfs hdfs haadmin -transitionToActive nn2
        [root@c0405 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn2
        active
        [root@c0405 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn1
        active
        [root@c0405 ~]# sudo -u hdfs hdfs haadmin -transitionToStandby nn1
        12/06/14 11:04:29 WARN ipc.Client: Unexpected error reading responses on connection Thread[IPC Client (44937684) connection to c0405.hal.cloudera.com/172.29.81.122:17020 from hdfs,5,main]
        java.lang.NullPointerException
        at org.apache.hadoop.ipc.Client$
        Connection.receiveResponse(Client.java:852)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781)
        Operation failed: Failed on local exception: java.io.IOException: Error reading responses; Host Details : local host is: "c0405.hal.cloudera.com/172.29.81.122"; destination host is: "c0405.hal.cloudera.com":17020;
        
        Show
        Eli Collins added a comment - [root@c0405 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn1 active [root@c0405 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn2 standby [root@c0405 ~]# sudo -u hdfs hdfs haadmin -transitionToActive nn2 [root@c0405 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn2 active [root@c0405 ~]# sudo -u hdfs hdfs haadmin -getServiceState nn1 active [root@c0405 ~]# sudo -u hdfs hdfs haadmin -transitionToStandby nn1 12/06/14 11:04:29 WARN ipc.Client: Unexpected error reading responses on connection Thread[IPC Client (44937684) connection to c0405.hal.cloudera.com/172.29.81.122:17020 from hdfs,5,main] java.lang.NullPointerException at org.apache.hadoop.ipc.Client$ Connection.receiveResponse(Client.java:852) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781) Operation failed: Failed on local exception: java.io.IOException: Error reading responses; Host Details : local host is: "c0405.hal.cloudera.com/172.29.81.122"; destination host is: "c0405.hal.cloudera.com":17020;
        Hide
        Todd Lipcon added a comment -

        So, you're inducing split brain on purpose?

        I think it's crashing because it tries to finalize its edit logs, which the other node has already finalized. This causes it to abort.

        The NPE you're seeing is just a bug in the IPC layer which throws NPE if the server disappears in the middle of RPC processing. This was fixed by HDFS-3504.

        Either way, I think the behavior during split brain is expected to be something like this.

        Show
        Todd Lipcon added a comment - So, you're inducing split brain on purpose? I think it's crashing because it tries to finalize its edit logs, which the other node has already finalized. This causes it to abort. The NPE you're seeing is just a bug in the IPC layer which throws NPE if the server disappears in the middle of RPC processing. This was fixed by HDFS-3504 . Either way, I think the behavior during split brain is expected to be something like this.
        Hide
        Eli Collins added a comment -

        Thanks, didn't realize the NPE was cleared up. Agree, we can punt on this.

        How about making manual transitionToActive and transitionToStandby log a WARNING of the ramifications (split brain) and indicating failover should be used instead?

        Thanks to Stephen Chu for identifying this originally btw.

        Show
        Eli Collins added a comment - Thanks, didn't realize the NPE was cleared up. Agree, we can punt on this. How about making manual transitionToActive and transitionToStandby log a WARNING of the ramifications (split brain) and indicating failover should be used instead? Thanks to Stephen Chu for identifying this originally btw.
        Hide
        Todd Lipcon added a comment -

        Didn't we already remove them from the documentation?

        Show
        Todd Lipcon added a comment - Didn't we already remove them from the documentation?
        Hide
        Eli Collins added a comment -

        Yup, thinking we should warn as well if they're used for those who don't rtfm.

        Show
        Eli Collins added a comment - Yup, thinking we should warn as well if they're used for those who don't rtfm.

          People

          • Assignee:
            Unassigned
            Reporter:
            Eli Collins
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development