Qpid
  1. Qpid
  2. QPID-2992

Cluster failing to resurrect durable static route depending on order of shutdown

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.8
    • Fix Version/s: None
    • Component/s: C++ Broker, C++ Clustering
    • Labels:
      None
    • Environment:

      Debian Linux Squeeze, 32-bit, kernel 2.6.36.2, Dell Poweredge 1950s. Corosync==1.3.0, Openais==1.1.4

      Description

      I've got a 2-node qpid test cluster at each of 2 datacenters, which are federated together with a single durable static route between each. Qpid is version 0.8. Corosync and openais are stock Squeeze (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell Poweredge 1950s, kernel 2.6.36. The static route is durable and is set up over SSL (but I can replicate as well with non-SSL). I've tried to normalize the hostnames below to make things clearer; hopefully I didn't mess anything up.

      Given two clusters, cluster A (consisting of hosts A1 and A2) and cluster B (with B1 and B2), I've got a static exchange route from A1 to B1, as well as another from B1 to A1. Federation is working correctly, so I can send a message on A2 and have it successfully retrieved on B2. The exchange local to cluster A is walmyex1; the local exchange for B is bosmyex1.

      If I shut down the cluster in this order: B2, then B1, and start back up with B1, B2, the static route route fails to get recreated. That is, on A1/A2, looking at the bindings, exchange 'bosmyex1' does not get re-bound to cluster B; the only output for it in "qpid-config exchanges --bindings" is just:

      <snip>
      Exchange 'bosmyex1' (direct)
      </snip>

      If however I shut the cluster down in this order: B1, then B2, and start B2, then B1, the static route gets re-bound. The output then is:

      <snip>
      Exchange 'bosmyex1' (direct)
      bind [unix.boston.cust] => bridge_queue_1_8870523d-2286-408e-b5b5-50d53db2fa61
      </bind>

      and I can message over the federated link with no further modification. Prior to a few minutes ago, I was seeing this with the Squeeze stock openais==1.1.2 and corosync==1.2.1. In debugging this, I've upgraded both to the latest versions with no change.

      I can replicate this every time I try. These are just test clusters, so I don't have any other activity going on on them, or any other exchanges/queues. My steps:

      On all boxes in cluster A and B:

      • Kill the qpidd if it's running and delete all existing store files, i.e. contents of /var/lib/qpid/

      On host A1 in cluster A (I'm leaving out the -a user/test@host stuff):

      • Start up qpid
      • qpid-config add exchange direct bosmyex1 --durable
      • qpid-config add exchange direct walmyex1 --durable
      • qpid-config add queue walmyq1 --durable
      • qpid-config bind walmyex1 walmyq1 unix.waltham.cust

      On host B1 in cluster B:

      • qpid-config add exchange direct bosmyex1 --durable
      • qpid-config add exchange direct walmyex1 --durable
      • qpid-config add queue bosmyq1 --durable
      • qpid-config bind bosmyex1 bosmyq1 unix.boston.cust

      On cluster A:

      • Start other member of cluster, A2
      • qpid-route route add amqps://user/pass@HOSTA1:5671 amqps://user/pass@HOSTB1:5671 walmyex1 unix.waltham.cust -d

      On cluster B:

      • Start other member of cluster, B2
      • qpid-route route add amqps://user/pass@HOSTB1:5671 amqps://user/pass@HOSTA1:5671 bosmyex1 unix.boston.cust -d

      On either cluster:

      • Check "qpid-config exchanges --bindings" to make sure bindings are correct for remote exchanges
      • To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start cluster back up, check bindings.
      • To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start cluster back up, check bindings.

      This is a test cluster, so I'm free to do anything with it, debugging-wise, that would be useful.

      1. cluster-fed.sh
        3 kB
        Ken Giusti
      2. error
        18 kB
        Mark Moseley

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            michael j. goulish
            Reporter:
            Mark Moseley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development