Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.10
    • Fix Version/s: 0.10
    • Component/s: None
    • Labels:
      None

      Description

      The JMS failover is not working due to a regression introduced in rev 1071631.
      More specifically the reason is the following line being removed.

      @@ -257,12 +256,14 @@
      ConnectionClose close = exc.getClose();
      if (close == null)
      {
      + _conn.getProtocolHandler().setFailoverLatch(new CountDownLatch(1));
      +
      try
      {
      if (_conn.firePreFailover(false) && _conn.attemptReconnection())

      { _conn.failoverPrep(); - _qpidConnection.resume(); ---- > This line is removed in this commit. + _conn.resubscribeSessions(); _conn.fireFailoverComplete(); return; }

      On the surface this seems like a unintended omission. However I'd like to investigate it further and find out if this was intentional.
      It if it was, then what was the reason behind it?

      The obvious fix is to re-introduce this line. I have done so and preliminary tests indicates that it will resolve the issue. I haven't seen any other issues.
      However I'd like to look at more information regarding the context of the above change before making a final decision.

        Activity

        Hide
        Rajith Attapattu added a comment -

        Committed the fix to the 0.10 release branch.

        Show
        Rajith Attapattu added a comment - Committed the fix to the 0.10 release branch.
        Hide
        Rajith Attapattu added a comment - - edited

        The failover issue was noticed before (due to test failures) and it was on my todo list to investigate.
        But there was a steady stream of issues that I had to deal with, hence the delay in getting to it.

        I have spent a lot of time working on the failover/transport code for several bugs Ex. QPID-3043, QPID-3042, QPID-2876,QPID-2861, QPID-2808, QPID-2994 ..etc
        The failover code is probably one of more fragile areas in the code base to say the least.
        Some of these issues cannot be tested with automated test cases due to various limitations, hence difficult to catch certain regressions/bugs.
        I need to get the python testkit working properly as it's a lot more flexible than the ant tests. It also seems to be able to reproduce a few of the client and broker bugs we had in the past.

        I plan to retest the failover area manually for some of the above issues. Post release I am planning to get the testkit running more frequently.
        I have tested this particular fix, but also plan more comprehensive tests around this area.

        Show
        Rajith Attapattu added a comment - - edited The failover issue was noticed before (due to test failures) and it was on my todo list to investigate. But there was a steady stream of issues that I had to deal with, hence the delay in getting to it. I have spent a lot of time working on the failover/transport code for several bugs Ex. QPID-3043 , QPID-3042 , QPID-2876 , QPID-2861 , QPID-2808 , QPID-2994 ..etc The failover code is probably one of more fragile areas in the code base to say the least. Some of these issues cannot be tested with automated test cases due to various limitations, hence difficult to catch certain regressions/bugs. I need to get the python testkit working properly as it's a lot more flexible than the ant tests. It also seems to be able to reproduce a few of the client and broker bugs we had in the past. I plan to retest the failover area manually for some of the above issues. Post release I am planning to get the testkit running more frequently. I have tested this particular fix, but also plan more comprehensive tests around this area.
        Hide
        Robbie Gemmell added a comment -

        The change looks to restore the original behaviour and so seems ok from that perspective, however I am concerned it took around 6 weeks to notice it wasn't working though; can we really be sure it is working after the change in that case?

        The JIRA needs updated with a fix-for of 0.10 if it is backported.

        Show
        Robbie Gemmell added a comment - The change looks to restore the original behaviour and so seems ok from that perspective, however I am concerned it took around 6 weeks to notice it wasn't working though; can we really be sure it is working after the change in that case? The JIRA needs updated with a fix-for of 0.10 if it is backported.
        Hide
        Andrew Kennedy added a comment -

        Can you review this fix, for potential inclusion in the 0.10 releaase, please?

        Show
        Andrew Kennedy added a comment - Can you review this fix, for potential inclusion in the 0.10 releaase, please?
        Hide
        Justin Ross added a comment -

        This is close enough to the end that I'd like to get Rob or Robbie to weigh in as well.

        Show
        Justin Ross added a comment - This is close enough to the end that I'd like to get Rob or Robbie to weigh in as well.
        Hide
        Andrew Kennedy added a comment -

        I thought Rajith was responsible for the broker component changes?

        Show
        Andrew Kennedy added a comment - I thought Rajith was responsible for the broker component changes?
        Hide
        Justin Ross added a comment -

        This issue is still awaiting review by a component maintainer. I'll go ahead and give my approval for 0.10, conditioned upon favorable said review, so that this can be committed promptly.

        Show
        Justin Ross added a comment - This issue is still awaiting review by a component maintainer. I'll go ahead and give my approval for 0.10, conditioned upon favorable said review, so that this can be committed promptly.
        Hide
        Rajith Attapattu added a comment -

        Andrew thanks for putting the missing method call back into the code.

        I am reopening the JIRA until we port the fix to 0.10 release branch.

        At this point I am treating this as a blocker for the 0.10 release as the JMS failover is a critical piece and is obviously a regression from our previous release.

        Show
        Rajith Attapattu added a comment - Andrew thanks for putting the missing method call back into the code. I am reopening the JIRA until we port the fix to 0.10 release branch. At this point I am treating this as a blocker for the 0.10 release as the JMS failover is a critical piece and is obviously a regression from our previous release.
        Hide
        Andrew Kennedy added a comment -

        Fixed on trunk. As this is a regression, it should be ported to the 0.10 branch as well.

        Show
        Andrew Kennedy added a comment - Fixed on trunk. As this is a regression, it should be ported to the 0.10 branch as well.
        Hide
        Andrew Kennedy added a comment -

        Yes, this is accidental. Thank's for spotting it. I had intended to refactor this more completely, and part of that involved moving the connection resume to resubscribeSessions, which is called indirectly. I have replaced this call.

        Show
        Andrew Kennedy added a comment - Yes, this is accidental. Thank's for spotting it. I had intended to refactor this more completely, and part of that involved moving the connection resume to resubscribeSessions, which is called indirectly. I have replaced this call.

          People

          • Assignee:
            Robbie Gemmell
            Reporter:
            Rajith Attapattu
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development