Qpid
  1. Qpid
  2. QPID-3532

Fix the blocking of JMS operations when failover happens

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13
    • Component/s: Java Client
    • Labels:
      None

      Description

      When connection is lost and failover is started the Qpid Client should block on invocation of JMS operations which require sending or receiving data over the network.
      With current implementation the performing of certain operations during failover can lead to unexpected behaviour.
      For example, closing QueueBrowsers during failover has been observed to cause issues because it is possible to send the old subscription destination in a cancel command to the new broker as the close and failover are allowed to progress concurrently. As result of it the broker might close the session with a NOT_FOUND execution exception because failover has not finished queue re-creation on a new broker

        Activity

        Hide
        Robbie Gemmell added a comment -

        Closing out, patch was applied and included in the last release.

        Show
        Robbie Gemmell added a comment - Closing out, patch was applied and included in the last release.
        Hide
        Robbie Gemmell added a comment -

        Committed for the alpha build as agreed, would appreciate you giving it a once over with a clustered setup to see whether it stays in or not.

        Show
        Robbie Gemmell added a comment - Committed for the alpha build as agreed, would appreciate you giving it a once over with a clustered setup to see whether it stays in or not.
        Hide
        Rajith Attapattu added a comment -

        Cool, go for it !

        Show
        Rajith Attapattu added a comment - Cool, go for it !
        Hide
        Robbie Gemmell added a comment -

        I do feel this is a reasonable improvement over what we have, considering failover was previously allowed to proceed concurrently with the client in any old state and as per the QueueBrowser tests could often allow completely incorrect behaviour as a result. Most of the operations the client does already do hold the failover mutex (ironically enough its mainly only 0-10 failover that didnt), so there shouldnt be much performance difference in these changes as there are basically only a few if statements in there now which werent there already.

        As I say, I'm not overly bothered whether this makes into 0.14 or not. That said, since this is only the alpha deadline, its at best 6 weeks until 0.14 ships (assuming we arent a month late like the last couple releases were), a month before an RC is due, and these changes are somewhat tiny and easily revertable, I wouldnt feel at all concerned about putting them in now even supposing we did have to take them out again in the next few weeks.

        Show
        Robbie Gemmell added a comment - I do feel this is a reasonable improvement over what we have, considering failover was previously allowed to proceed concurrently with the client in any old state and as per the QueueBrowser tests could often allow completely incorrect behaviour as a result. Most of the operations the client does already do hold the failover mutex (ironically enough its mainly only 0-10 failover that didnt), so there shouldnt be much performance difference in these changes as there are basically only a few if statements in there now which werent there already. As I say, I'm not overly bothered whether this makes into 0.14 or not. That said, since this is only the alpha deadline, its at best 6 weeks until 0.14 ships (assuming we arent a month late like the last couple releases were), a month before an RC is due, and these changes are somewhat tiny and easily revertable, I wouldnt feel at all concerned about putting them in now even supposing we did have to take them out again in the next few weeks.
        Hide
        Rajith Attapattu added a comment -

        Robbie, thanks for the quick response.

        I can try to test this out with a clustered setup, however we have our f2f meeting next week so if I don't get it done by tomorrow then it will be delayed by a week. Hence one reason for my reluctance to see this go through for 0.14.

        While it is nice to have automated tests, through experience I've come to realize that they only provide very little confidence for a feature like failover. All most all of the previous failover issues were identified by manual testing with more real life scenarios or in production environments.

        Most issues tend to happen when failover happens in the middle of a client going full steam (with operations like producing/consuming/creating/querying ..etc). It is almost impossible to replicate this kind of scenario with the automated testing. The fact that there was no testing resembling production scenarios on your end makes a me a bit concerned about the changes.

        I've tried to semi automate tests like this using the python testkit, but never got around to really get it to a reasonable state as it kept failing behind with changes made on the clustering side and brokertest.py .

        Another hidden danger of such a change could be an impact on performance. Something that cannot be verified by merely running automated tests. Again if we have an agreed upon framework where we can benchmark before and after fixes we suspect of being perf sensitive that would have been great. We could also run them on a per release basis to ensure that we have either improved or regressed on the perf front.

        Anyways, If you feel confident about these changes and strongly feel they should be included in 0.14 then I'd accept your word on it and will not make a fuss about it. But I have to note that I have very little confidence in the changes or the testing being done here so far. Please note it's not a reflection of the work you've done for this particular patch, rather more on the inadequate testing strategy we have with the client in general.

        Regards,

        Rajith

        Show
        Rajith Attapattu added a comment - Robbie, thanks for the quick response. I can try to test this out with a clustered setup, however we have our f2f meeting next week so if I don't get it done by tomorrow then it will be delayed by a week. Hence one reason for my reluctance to see this go through for 0.14. While it is nice to have automated tests, through experience I've come to realize that they only provide very little confidence for a feature like failover. All most all of the previous failover issues were identified by manual testing with more real life scenarios or in production environments. Most issues tend to happen when failover happens in the middle of a client going full steam (with operations like producing/consuming/creating/querying ..etc). It is almost impossible to replicate this kind of scenario with the automated testing. The fact that there was no testing resembling production scenarios on your end makes a me a bit concerned about the changes. I've tried to semi automate tests like this using the python testkit, but never got around to really get it to a reasonable state as it kept failing behind with changes made on the clustering side and brokertest.py . Another hidden danger of such a change could be an impact on performance. Something that cannot be verified by merely running automated tests. Again if we have an agreed upon framework where we can benchmark before and after fixes we suspect of being perf sensitive that would have been great. We could also run them on a per release basis to ensure that we have either improved or regressed on the perf front. Anyways, If you feel confident about these changes and strongly feel they should be included in 0.14 then I'd accept your word on it and will not make a fuss about it. But I have to note that I have very little confidence in the changes or the testing being done here so far. Please note it's not a reflection of the work you've done for this particular patch, rather more on the inadequate testing strategy we have with the client in general. Regards, Rajith
        Hide
        Robbie Gemmell added a comment -

        The testing done is the failover behaviour tests added both before this patch, and in this patch. There has been no testing in a clustered environment, only the standalone Java and C++ broker profiles. We are not in a position to test the cluster, which is why this is up for review and was not simply committed.

        What change backouts are you referring to? No changes have been backed out that I can think of other than 1 trivial revert I did earlier today of some completely irrelevant code change, which was simply restored to its previous state because it looked nicer. That did not need reverted, as there was no functional difference between the two versions, but it looked cleaner the way it was.

        The summary of changes for the patch is as given in the review posting. There is not a lot to say about it since it is just a small amount of change and leaves little to really describe.

        More tests could be useful, but I am at least happy we do have signficantly more (i.e > 0) automated tests of the shipping behaviour of the client for these failover+addressing changes than most of the others made in the past 18months have.

        I'm not particularly bothered whether this makes into 0.14 or not, we have just been sitting on an incomplete version of it for long enough now that it was time to get it finished and posted

        Show
        Robbie Gemmell added a comment - The testing done is the failover behaviour tests added both before this patch, and in this patch. There has been no testing in a clustered environment, only the standalone Java and C++ broker profiles. We are not in a position to test the cluster, which is why this is up for review and was not simply committed. What change backouts are you referring to? No changes have been backed out that I can think of other than 1 trivial revert I did earlier today of some completely irrelevant code change, which was simply restored to its previous state because it looked nicer. That did not need reverted, as there was no functional difference between the two versions, but it looked cleaner the way it was. The summary of changes for the patch is as given in the review posting. There is not a lot to say about it since it is just a small amount of change and leaves little to really describe. More tests could be useful, but I am at least happy we do have signficantly more (i.e > 0) automated tests of the shipping behaviour of the client for these failover+addressing changes than most of the others made in the past 18months have. I'm not particularly bothered whether this makes into 0.14 or not, we have just been sitting on an incomplete version of it for long enough now that it was time to get it finished and posted
        Hide
        Rajith Attapattu added a comment -

        What kind of testing has been done on this patch?
        Has there been any testing done in a clustered setup?

        This change obviously impacts any failover operation. Given that failover is our Achilles heel and that the Failover mutex being a source of many race conditions and deadlocks, I'm quite concerned about making this change just before the deadline.

        I am also concerned that certain changes made in the last week have been backed out, giving the impression that we've been hurried by the looming deadline. Perhaps we will benefit by taking more time to think through and verify these changes and target these fixes for our next release.

        Also what would be beneficial is a summary of changes being done on the client. It's true we discussed the issues and possible solutions. But I think it's equally important to have an outline of the code changes being made especially given that they were made very close to the release. This makes it easy for anybody reviewing those changes.

        Again given the fact that we don't have quality tests to verify these changes with confidence, I'm of the opinion that we make these changes after the 0.14 release.

        Regards,

        Rajith

        Show
        Rajith Attapattu added a comment - What kind of testing has been done on this patch? Has there been any testing done in a clustered setup? This change obviously impacts any failover operation. Given that failover is our Achilles heel and that the Failover mutex being a source of many race conditions and deadlocks, I'm quite concerned about making this change just before the deadline. I am also concerned that certain changes made in the last week have been backed out, giving the impression that we've been hurried by the looming deadline. Perhaps we will benefit by taking more time to think through and verify these changes and target these fixes for our next release. Also what would be beneficial is a summary of changes being done on the client. It's true we discussed the issues and possible solutions. But I think it's equally important to have an outline of the code changes being made especially given that they were made very close to the release. This makes it easy for anybody reviewing those changes. Again given the fact that we don't have quality tests to verify these changes with confidence, I'm of the opinion that we make these changes after the 0.14 release. Regards, Rajith
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/2469/
        -----------------------------------------------------------

        Review request for qpid and rajith attapattu.

        Summary
        -------

        https://issues.apache.org/jira/browse/QPID-3532 describes some issues with non-blocking behaviour of 0-10 client failover.

        This patch makes the 0-10 client hold the failover mutex during the failover. It also alters the Address resolution code to allow resolving addresses after failover and adds some more failover tests (inc ADDR based ones). Additionally, it makes the failover process notify any waiters in the session to abort and let failover proceed as they would otherwise prevent the failover from occurring until they timed out after 60 seconds.

        This work does not go as far as we would like but still represents a functional improvement to failover. Additional improvements to the client locking model (such as an introduction a single connection lock) will require further iteration.

        The patch was worked on by myself and Robbie.

        This addresses bug QPID-3532.
        https://issues.apache.org/jira/browse/QPID-3532

        Diffs


        http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQConnection.java 1186863
        http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQConnectionDelegate_0_10.java 1186863
        http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQDestination.java 1186863
        http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQSession_0_10.java 1186863
        http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/BasicMessageProducer_0_10.java 1186863
        http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/transport/Connection.java 1186863
        http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/transport/Session.java 1186863
        http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/systests/src/main/java/org/apache/qpid/client/failover/FailoverBehaviourTest.java 1186863
        http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/test-profiles/JavaPre010Excludes 1186863

        Diff: https://reviews.apache.org/r/2469/diff

        Testing
        -------

        Thanks,

        Oleksandr

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2469/ ----------------------------------------------------------- Review request for qpid and rajith attapattu. Summary ------- https://issues.apache.org/jira/browse/QPID-3532 describes some issues with non-blocking behaviour of 0-10 client failover. This patch makes the 0-10 client hold the failover mutex during the failover. It also alters the Address resolution code to allow resolving addresses after failover and adds some more failover tests (inc ADDR based ones). Additionally, it makes the failover process notify any waiters in the session to abort and let failover proceed as they would otherwise prevent the failover from occurring until they timed out after 60 seconds. This work does not go as far as we would like but still represents a functional improvement to failover. Additional improvements to the client locking model (such as an introduction a single connection lock) will require further iteration. The patch was worked on by myself and Robbie. This addresses bug QPID-3532 . https://issues.apache.org/jira/browse/QPID-3532 Diffs http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQConnection.java 1186863 http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQConnectionDelegate_0_10.java 1186863 http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQDestination.java 1186863 http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQSession_0_10.java 1186863 http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/BasicMessageProducer_0_10.java 1186863 http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/transport/Connection.java 1186863 http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/transport/Session.java 1186863 http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/systests/src/main/java/org/apache/qpid/client/failover/FailoverBehaviourTest.java 1186863 http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/test-profiles/JavaPre010Excludes 1186863 Diff: https://reviews.apache.org/r/2469/diff Testing ------- Thanks, Oleksandr
        Hide
        Alex Rudyy added a comment -

        Attached a patch fixing the issue

        Show
        Alex Rudyy added a comment - Attached a patch fixing the issue

          People

          • Assignee:
            Alex Rudyy
            Reporter:
            Alex Rudyy
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development