Hadoop Common
  1. Hadoop Common
  2. HADOOP-8220

ZKFailoverController doesn't handle failure to become active correctly

    Details

      Description

      The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly.

      1. hadoop-8220.txt
        15 kB
        Todd Lipcon
      2. hadoop-8220.txt
        15 kB
        Todd Lipcon
      3. hadoop-8220.txt
        12 kB
        Todd Lipcon
      4. hadoop-8220.txt
        12 kB
        Todd Lipcon
      5. hadoop-8220.txt
        9 kB
        Todd Lipcon

        Activity

        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12520062/hadoop-8220.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.ha.TestFailoverController
        org.apache.hadoop.ha.TestZKFailoverController

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/789//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/789//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12520062/hadoop-8220.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.ha.TestFailoverController org.apache.hadoop.ha.TestZKFailoverController +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/789//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/789//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        Tests failing due to HADOOP-8218

        Show
        Todd Lipcon added a comment - Tests failing due to HADOOP-8218
        Hide
        Bikas Saha added a comment -

        Test missing for changes in ActiveStandbyElector.

        Do you really want to commit the logs added to ActiveStandbyTestUtil?

        The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly.

        How does NPE occur when the elector makes sure the client is recreated upon rejoining the election? Which zkClient are you talking about?
        What is the purpose of adding the sleep? Could you please elaborate?

        Show
        Bikas Saha added a comment - Test missing for changes in ActiveStandbyElector. Do you really want to commit the logs added to ActiveStandbyTestUtil? The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly. How does NPE occur when the elector makes sure the client is recreated upon rejoining the election? Which zkClient are you talking about? What is the purpose of adding the sleep? Could you please elaborate?
        Hide
        Todd Lipcon added a comment -

        I'll add a new test to the ActiveStandbyElector-specific code for this. I was testing it via the "integration test", but you're right that adding to the unit tests makes sense too.

        How does NPE occur when the elector makes sure the client is recreated upon rejoining the election? Which zkClient are you talking about?

        The NPE occurred in the previous code because we had the following sequence:

        • createNode succeeded
        • called ZKFC becomeActive() callback
          • becomeActive() throws exception
          • ZKFC had a catch() clause which called quitElection () (it turned out this wasn't the right behavior)
            • quitElection() nulled out zkClient
        • ActiveStandbyElector called monitorNode(), which tried to use zkClient, which had just been nulled out.

        The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself. This makes it easier to get the right semantics.

        What is the purpose of adding the sleep? Could you please elaborate?

        Without the sleep, it will do a tight loop retrying to become active. This generates a lot of log spew and has little actual benefit. If instead we retry only once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode in the cluster, it will get a chance to try to become active.

        I will add a comment to this effect in the code.

        Show
        Todd Lipcon added a comment - I'll add a new test to the ActiveStandbyElector-specific code for this. I was testing it via the "integration test", but you're right that adding to the unit tests makes sense too. How does NPE occur when the elector makes sure the client is recreated upon rejoining the election? Which zkClient are you talking about? The NPE occurred in the previous code because we had the following sequence: createNode succeeded called ZKFC becomeActive() callback becomeActive() throws exception ZKFC had a catch() clause which called quitElection () (it turned out this wasn't the right behavior) quitElection() nulled out zkClient ActiveStandbyElector called monitorNode(), which tried to use zkClient, which had just been nulled out. The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself. This makes it easier to get the right semantics. What is the purpose of adding the sleep? Could you please elaborate? Without the sleep, it will do a tight loop retrying to become active. This generates a lot of log spew and has little actual benefit. If instead we retry only once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode in the cluster, it will get a chance to try to become active. I will add a comment to this effect in the code.
        Hide
        Hari Mankude added a comment -

        Todd,

        Couple of comments on the design rather than the code changes

        1. becomeActive() could have two return values at the appClient (true or false). For example, NN might decide that it cannot become active since it does not have access to the resources to become active. We would need to handle these two return values. If the return value is false, FC will give up the ephemeral znode for one "timeout" iteration to allow other NN's FC to take over the znode.

        2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock.

        As you commented on the other jira, it might be useful to create a seperate branch for automatic failover. There will be lots of corner cases to deal with.

        Show
        Hari Mankude added a comment - Todd, Couple of comments on the design rather than the code changes 1. becomeActive() could have two return values at the appClient (true or false). For example, NN might decide that it cannot become active since it does not have access to the resources to become active. We would need to handle these two return values. If the return value is false, FC will give up the ephemeral znode for one "timeout" iteration to allow other NN's FC to take over the znode. 2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock. As you commented on the other jira, it might be useful to create a seperate branch for automatic failover. There will be lots of corner cases to deal with.
        Hide
        Bikas Saha added a comment -

        The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself. This makes it easier to get the right semantics.

        Ah. Now I get it. The elector should be robust against client code (ZKFC in this case). I like Hari's proposal of using a return value to inform about fail/success of becoming active. I am not that familiar with standard practices in Java - are return values preferred or exceptions?

        This generates a lot of log spew and has little actual benefit. If instead we retry only once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode in the cluster, it will get a chance to try to become active.

        I did not understand where the tight loop is? Do you mean (Elector gets lock<->ZKFC fails to becomes active)?
        I do not have any data on the trade off between 1) letting the last active become active again with log spew 2) letting another standby become active by making the last active sleep. But for arguments sake I would prefer 1). IMO having continuity in the active node would reduce the overheads of client/datanode failover etc.

        becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock.

        Hari, this seems similar to the alternative proposed in HADOOP-8205 about trying to make sure that the transition to active is short.

        Show
        Bikas Saha added a comment - The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself. This makes it easier to get the right semantics. Ah. Now I get it. The elector should be robust against client code (ZKFC in this case). I like Hari's proposal of using a return value to inform about fail/success of becoming active. I am not that familiar with standard practices in Java - are return values preferred or exceptions? This generates a lot of log spew and has little actual benefit. If instead we retry only once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode in the cluster, it will get a chance to try to become active. I did not understand where the tight loop is? Do you mean (Elector gets lock<->ZKFC fails to becomes active)? I do not have any data on the trade off between 1) letting the last active become active again with log spew 2) letting another standby become active by making the last active sleep. But for arguments sake I would prefer 1). IMO having continuity in the active node would reduce the overheads of client/datanode failover etc. becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock. Hari, this seems similar to the alternative proposed in HADOOP-8205 about trying to make sure that the transition to active is short.
        Hide
        Todd Lipcon added a comment -

        Ah. Now I get it. The elector should be robust against client code (ZKFC in this case). I like Hari's proposal of using a return value to inform about fail/success of becoming active. I am not that familiar with standard practices in Java - are return values preferred or exceptions?

        You got it. Exceptions are generally preferred for cases like this – since we have to handle the error condition regardless of whether it's a usual error or whether it was something like a NPE or other truly exceptional condition. So even with a boolean return type, we'd need a try/catch clause. Does that make sense? (I also had originally made it return boolean but then changed it to an exception)

        I did not understand where the tight loop is? Do you mean (Elector gets lock<->ZKFC fails to becomes active)?

        Yep. In my test I saw that the standby would retry in a tight loop like that:

        1. Succeed in getting lock
        2. Call becomeActive()
        3. drop ZK session (lock disappears)
        4. reconnect to ZK
        5. Goto 1

        I simply inserted a sleep between dropping the connection and reconnecting. This gives the old active a better chance to become active again (or if there is a third node in the future, it would have a chance to take the lock). In the future we may want to add some random jitter and exponential backoff, but at this point let's keep it simple.

        Show
        Todd Lipcon added a comment - Ah. Now I get it. The elector should be robust against client code (ZKFC in this case). I like Hari's proposal of using a return value to inform about fail/success of becoming active. I am not that familiar with standard practices in Java - are return values preferred or exceptions? You got it. Exceptions are generally preferred for cases like this – since we have to handle the error condition regardless of whether it's a usual error or whether it was something like a NPE or other truly exceptional condition. So even with a boolean return type, we'd need a try/catch clause. Does that make sense? (I also had originally made it return boolean but then changed it to an exception) I did not understand where the tight loop is? Do you mean (Elector gets lock<->ZKFC fails to becomes active)? Yep. In my test I saw that the standby would retry in a tight loop like that: Succeed in getting lock Call becomeActive() drop ZK session (lock disappears) reconnect to ZK Goto 1 I simply inserted a sleep between dropping the connection and reconnecting. This gives the old active a better chance to become active again (or if there is a third node in the future, it would have a chance to take the lock). In the future we may want to add some random jitter and exponential backoff, but at this point let's keep it simple.
        Hide
        Bikas Saha added a comment -

        Do you mean?
        1. Succeed in getting lock
        2. Call becomeActive()
        3. ZKFC fails to become active. Call quitElection.
        4. drop ZK session (lock disappears)
        5. reconnect to ZK
        6. Goto 1

        Otherwise I dont see why lock disappears.
        If yes, then this might be ok, since by design we are deciding to sleep and let someone else take a shot at becoming active because we are having trouble doing so. Could you please add this in comments so that the sleeping is explained.

        Now that I am looking at the patch in a less sleep state I think the following might be a better flow because the actions on failure and success are in one place

        if (becomeActive())

        { monitorActiveStatus(); }

        else

        { LOG.warn("Failed to become active. Rejoin election to try again but sleep before that to let someone else try."); reJoinElection(SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE); }

        This puts the fail/success handling of becomeActive() in 1 place and avoids the calling of becomeActive() have a side-effect of also calling rejoinElection() on failure.
        What do you think?

        Show
        Bikas Saha added a comment - Do you mean? 1. Succeed in getting lock 2. Call becomeActive() 3. ZKFC fails to become active. Call quitElection. 4. drop ZK session (lock disappears) 5. reconnect to ZK 6. Goto 1 Otherwise I dont see why lock disappears. If yes, then this might be ok, since by design we are deciding to sleep and let someone else take a shot at becoming active because we are having trouble doing so. Could you please add this in comments so that the sleeping is explained. Now that I am looking at the patch in a less sleep state I think the following might be a better flow because the actions on failure and success are in one place if (becomeActive()) { monitorActiveStatus(); } else { LOG.warn("Failed to become active. Rejoin election to try again but sleep before that to let someone else try."); reJoinElection(SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE); } This puts the fail/success handling of becomeActive() in 1 place and avoids the calling of becomeActive() have a side-effect of also calling rejoinElection() on failure. What do you think?
        Hide
        Todd Lipcon added a comment -

        Yep, your updated description of the "tight loop" is exactly right. Sorry, I didn't note the fact that becomeActive() throws an exception in this scenario.

        New draft of the patch attached.

        • Added a true unit test for the new changes, in addition to the functional test from the prior revision (TestActiveStandbyElector#testFailToBecomeActive)
        • Change the control flow so that the success and error cases are kept near each other (suggested by Bikas above)
        • Changed the sleep calls to be wrapped in a sleepFor(ms) function, so it's easy to disable the sleeping behavior in the unit tests. Otherwise the tests ran longer for no good reason.

        In response to a couple comments above that got lost in the discussion:

        2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock.

        This is really difficult to do reliably, since there's no good way to 'cancel' the callback. The transitionToActive RPC itself should have a timeout attached – it's much more straightforward to do that than to try to make ActiveStandbyElector guard against arbitrary code running too long in the callback. I added a note to the javadoc indicating this.

        Do you really want to commit the logs added to ActiveStandbyTestUtil?

        Yes, I found that when I had a test failure due to timeout, it was difficult to debug, since I couldn't easily tell which node had the lock at the time the test timed out. I rate-limited the logging to only two per second, so it shouldn't make the logs too noisy, while retaining the advantage of seeing what's going on better if there is a timeout.

        Show
        Todd Lipcon added a comment - Yep, your updated description of the "tight loop" is exactly right. Sorry, I didn't note the fact that becomeActive() throws an exception in this scenario. New draft of the patch attached. Added a true unit test for the new changes, in addition to the functional test from the prior revision (TestActiveStandbyElector#testFailToBecomeActive) Change the control flow so that the success and error cases are kept near each other (suggested by Bikas above) Changed the sleep calls to be wrapped in a sleepFor(ms) function, so it's easy to disable the sleeping behavior in the unit tests. Otherwise the tests ran longer for no good reason. In response to a couple comments above that got lost in the discussion: 2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock. This is really difficult to do reliably, since there's no good way to 'cancel' the callback. The transitionToActive RPC itself should have a timeout attached – it's much more straightforward to do that than to try to make ActiveStandbyElector guard against arbitrary code running too long in the callback. I added a note to the javadoc indicating this. Do you really want to commit the logs added to ActiveStandbyTestUtil? Yes, I found that when I had a test failure due to timeout, it was difficult to debug, since I couldn't easily tell which node had the lock at the time the test timed out. I rate-limited the logging to only two per second, so it shouldn't make the logs too noisy, while retaining the advantage of seeing what's going on better if there is a timeout.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12520334/hadoop-8220.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.fs.viewfs.TestViewFsTrash

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/793//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/793//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12520334/hadoop-8220.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.fs.viewfs.TestViewFsTrash +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/793//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/793//console This message is automatically generated.
        Hide
        Bikas Saha added a comment -

        I dont think this comment is correct

        // The failure to become active is already logged above.

        There is no log above. Actually you might just want to move the log above and remove the comment altogether.

        +1

        Show
        Bikas Saha added a comment - I dont think this comment is correct // The failure to become active is already logged above. There is no log above. Actually you might just want to move the log above and remove the comment altogether. +1
        Hide
        Todd Lipcon added a comment -

        The "above" there refers to the fact that becomeActive() itself logs the error message.
        I changed the comment to:

        +        // The failure to become active is already logged inside
        +        // becomeActive().
        
        Show
        Todd Lipcon added a comment - The "above" there refers to the fact that becomeActive() itself logs the error message. I changed the comment to: + // The failure to become active is already logged inside + // becomeActive().
        Hide
        Todd Lipcon added a comment -

        New patch incorporating above comment change.

        Show
        Todd Lipcon added a comment - New patch incorporating above comment change.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12520520/hadoop-8220.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.fs.viewfs.TestViewFsTrash

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/798//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/798//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12520520/hadoop-8220.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.fs.viewfs.TestViewFsTrash +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/798//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/798//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        Actually moving the error handling code to the call site (instead of inside becomeActive()) introduced a bug, since we call becomeActive() from another spot as well, in the StatCallback. So we need to have similar code there, or move the error handling back up into becomeActive()

        Show
        Todd Lipcon added a comment - Actually moving the error handling code to the call site (instead of inside becomeActive()) introduced a bug, since we call becomeActive() from another spot as well, in the StatCallback. So we need to have similar code there, or move the error handling back up into becomeActive()
        Hide
        Todd Lipcon added a comment -

        New version addresses the above issue, handling failure to become active in the StatCallback as well.

        I also added a new unit test for this case.

        As part of the stress tests in HADOOP-8228, I also have added a test case which covers this codepath with a real ZK server (verified with Clover)

        Show
        Todd Lipcon added a comment - New version addresses the above issue, handling failure to become active in the StatCallback as well. I also added a new unit test for this case. As part of the stress tests in HADOOP-8228 , I also have added a test case which covers this codepath with a real ZK server (verified with Clover)
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12520563/hadoop-8220.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/801//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12520563/hadoop-8220.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/801//console This message is automatically generated.
        Hide
        Bikas Saha added a comment -

        lgtm

        Show
        Bikas Saha added a comment - lgtm
        Hide
        Aaron T. Myers added a comment -

        Patch looks pretty good to me, Todd. A few little comments. +1 once these are addressed:

        1. Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE configurable?
        2. There's some inconsistency in capitalization between "reJoinElection" and "rejoinElectionAfterFailureToBecomeActive"
        3. Might want to do a s/System.currentTimeMillis/Util.now/g
        4. Any reason we shouldn't make LOG_INTERVAL_MS configurable?
        5. Add @VisibleForTesting to sleepFor, since it would be private (and probably static) otherwise. Maybe even add a comment saying why it's not static.
        6. Considering the comment says "after sleeping for a short period" in TestActiveStandbyElector#testFailToBecomeActive, maybe also verify that sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect.
        Show
        Aaron T. Myers added a comment - Patch looks pretty good to me, Todd. A few little comments. +1 once these are addressed: Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE configurable? There's some inconsistency in capitalization between "reJoinElection" and "rejoinElectionAfterFailureToBecomeActive" Might want to do a s/System.currentTimeMillis/Util.now/g Any reason we shouldn't make LOG_INTERVAL_MS configurable? Add @VisibleForTesting to sleepFor, since it would be private (and probably static) otherwise. Maybe even add a comment saying why it's not static. Considering the comment says "after sleeping for a short period" in TestActiveStandbyElector#testFailToBecomeActive, maybe also verify that sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect.
        Hide
        Todd Lipcon added a comment -

        Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE configurable?

        Currently, ActiveStandbyElector doesn't take a Configuration object. I think many of the parameters should be changed to be configured via Configuration, but I didn't want to make this into a bigger scoped change.

        There's some inconsistency in capitalization between "reJoinElection" and "rejoinElectionAfterFailureToBecomeActive"

        Changed to consistently use "reJoin" to match the previously existing code.

        Might want to do a s/System.currentTimeMillis/Util.now/g

        The Util class is in HDFS, but this code is in common. We don't seem to have an equivalent in common.

        Any reason we shouldn't make LOG_INTERVAL_MS configurable?

        It's just test code, so seemed unnecessary.

        Add @VisibleForTesting to sleepFor, since it would be private (and probably static) otherwise. Maybe even add a comment saying why it's not static.

        Considering the comment says "after sleeping for a short period" in TestActiveStandbyElector#testFailToBecomeActive, maybe also verify that sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect.

        Done. I made the overridden method keep a tally of number of slept millis, and added asserts to the tests to make sure it slept for some time when rejoining.

        Show
        Todd Lipcon added a comment - Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE configurable? Currently, ActiveStandbyElector doesn't take a Configuration object. I think many of the parameters should be changed to be configured via Configuration, but I didn't want to make this into a bigger scoped change. There's some inconsistency in capitalization between "reJoinElection" and "rejoinElectionAfterFailureToBecomeActive" Changed to consistently use "reJoin" to match the previously existing code. Might want to do a s/System.currentTimeMillis/Util.now/g The Util class is in HDFS, but this code is in common. We don't seem to have an equivalent in common. Any reason we shouldn't make LOG_INTERVAL_MS configurable? It's just test code, so seemed unnecessary. Add @VisibleForTesting to sleepFor, since it would be private (and probably static) otherwise. Maybe even add a comment saying why it's not static. Considering the comment says "after sleeping for a short period" in TestActiveStandbyElector#testFailToBecomeActive, maybe also verify that sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect. Done. I made the overridden method keep a tally of number of slept millis, and added asserts to the tests to make sure it slept for some time when rejoining.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12520652/hadoop-8220.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/805//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12520652/hadoop-8220.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/805//console This message is automatically generated.
        Hide
        Aaron T. Myers added a comment -

        Thanks for addressing my comments, Todd.

        +1, the latest patch looks good to me.

        Show
        Aaron T. Myers added a comment - Thanks for addressing my comments, Todd. +1, the latest patch looks good to me.
        Hide
        Todd Lipcon added a comment -

        Committed to HDFS-3042 branch. Thanks for the reviews, Bikas and Aaron.

        Show
        Todd Lipcon added a comment - Committed to HDFS-3042 branch. Thanks for the reviews, Bikas and Aaron.

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development