Qpid
  1. Qpid
  2. QPID-2992

Cluster failing to resurrect durable static route depending on order of shutdown

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.8
    • Fix Version/s: None
    • Component/s: C++ Broker, C++ Clustering
    • Labels:
      None
    • Environment:

      Debian Linux Squeeze, 32-bit, kernel 2.6.36.2, Dell Poweredge 1950s. Corosync==1.3.0, Openais==1.1.4

      Description

      I've got a 2-node qpid test cluster at each of 2 datacenters, which are federated together with a single durable static route between each. Qpid is version 0.8. Corosync and openais are stock Squeeze (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell Poweredge 1950s, kernel 2.6.36. The static route is durable and is set up over SSL (but I can replicate as well with non-SSL). I've tried to normalize the hostnames below to make things clearer; hopefully I didn't mess anything up.

      Given two clusters, cluster A (consisting of hosts A1 and A2) and cluster B (with B1 and B2), I've got a static exchange route from A1 to B1, as well as another from B1 to A1. Federation is working correctly, so I can send a message on A2 and have it successfully retrieved on B2. The exchange local to cluster A is walmyex1; the local exchange for B is bosmyex1.

      If I shut down the cluster in this order: B2, then B1, and start back up with B1, B2, the static route route fails to get recreated. That is, on A1/A2, looking at the bindings, exchange 'bosmyex1' does not get re-bound to cluster B; the only output for it in "qpid-config exchanges --bindings" is just:

      <snip>
      Exchange 'bosmyex1' (direct)
      </snip>

      If however I shut the cluster down in this order: B1, then B2, and start B2, then B1, the static route gets re-bound. The output then is:

      <snip>
      Exchange 'bosmyex1' (direct)
      bind [unix.boston.cust] => bridge_queue_1_8870523d-2286-408e-b5b5-50d53db2fa61
      </bind>

      and I can message over the federated link with no further modification. Prior to a few minutes ago, I was seeing this with the Squeeze stock openais==1.1.2 and corosync==1.2.1. In debugging this, I've upgraded both to the latest versions with no change.

      I can replicate this every time I try. These are just test clusters, so I don't have any other activity going on on them, or any other exchanges/queues. My steps:

      On all boxes in cluster A and B:

      • Kill the qpidd if it's running and delete all existing store files, i.e. contents of /var/lib/qpid/

      On host A1 in cluster A (I'm leaving out the -a user/test@host stuff):

      • Start up qpid
      • qpid-config add exchange direct bosmyex1 --durable
      • qpid-config add exchange direct walmyex1 --durable
      • qpid-config add queue walmyq1 --durable
      • qpid-config bind walmyex1 walmyq1 unix.waltham.cust

      On host B1 in cluster B:

      • qpid-config add exchange direct bosmyex1 --durable
      • qpid-config add exchange direct walmyex1 --durable
      • qpid-config add queue bosmyq1 --durable
      • qpid-config bind bosmyex1 bosmyq1 unix.boston.cust

      On cluster A:

      • Start other member of cluster, A2
      • qpid-route route add amqps://user/pass@HOSTA1:5671 amqps://user/pass@HOSTB1:5671 walmyex1 unix.waltham.cust -d

      On cluster B:

      • Start other member of cluster, B2
      • qpid-route route add amqps://user/pass@HOSTB1:5671 amqps://user/pass@HOSTA1:5671 bosmyex1 unix.boston.cust -d

      On either cluster:

      • Check "qpid-config exchanges --bindings" to make sure bindings are correct for remote exchanges
      • To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start cluster back up, check bindings.
      • To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start cluster back up, check bindings.

      This is a test cluster, so I'm free to do anything with it, debugging-wise, that would be useful.

      1. cluster-fed.sh
        3 kB
        Ken Giusti
      2. error
        18 kB
        Mark Moseley

        Activity

        Hide
        Ken Giusti added a comment -

        Hi Mark,

        I'm trying to reproduce this problem on my Fedora 14 box, without luck. Can you try the attached script - you'll have to modify it a bit to find your cluster and message store libraries - and let me know if it causes the problem for you?

        thanks,

        -K

        Show
        Ken Giusti added a comment - Hi Mark, I'm trying to reproduce this problem on my Fedora 14 box, without luck. Can you try the attached script - you'll have to modify it a bit to find your cluster and message store libraries - and let me know if it causes the problem for you? thanks, -K
        Hide
        Mark Moseley added a comment -

        On one of the nodes in question. I tried reproducing with this script and it seemed to work perfectly. I added authentication as well, and it continued to work ok. Your test script is pretty much exactly what I'm doing too.

        I wonder though (and I'm just trying to think of reasons why it'd act differently in the two scenarios) can you try this out on 4 separate nodes, even if virtualized? Though when I reproduce this on the physical nodes, with debug logging turned on, it doesn't mention the node on the other side of the federated link, whereas when it does work, I see this in the logs:

        2011-01-10 19:35:12 debug Known hosts for peer of inter-broker link: amqp:tcp:10.1.58.3:5672 amqp:tcp:10.1.58.4:5672

        Running through this again today, I noticed that sometimes, with a completely fresh cluster, the connection in a B2->B1->B1->B2 shutdown/startup does work. But then I do it again and it doesn't. Or if I do the opposite order it breaks as well.

        I just modified your script so that after the first round of stop/start/check-binding, it flips the order and shuts them down again and starts them up – and yes, I realize this is the opposite order from my ticket – and re-checks bindings and they're gone. I'm attaching the output of your script.

        (Just for clarification, 10.1.58.3==exp01==A1, 10.1.58.4==exp02==A2, 10.20.58.1==bosmsg01==B1, and 10.20.58.2==bosmsg02==B2. I've been trying to regex the hostnames so you guys didn't have to deal with following my hostnames, but if you guys prefer, I don't mind just using the real names.)

        Show
        Mark Moseley added a comment - On one of the nodes in question. I tried reproducing with this script and it seemed to work perfectly. I added authentication as well, and it continued to work ok. Your test script is pretty much exactly what I'm doing too. I wonder though (and I'm just trying to think of reasons why it'd act differently in the two scenarios) can you try this out on 4 separate nodes, even if virtualized? Though when I reproduce this on the physical nodes, with debug logging turned on, it doesn't mention the node on the other side of the federated link, whereas when it does work, I see this in the logs: 2011-01-10 19:35:12 debug Known hosts for peer of inter-broker link: amqp:tcp:10.1.58.3:5672 amqp:tcp:10.1.58.4:5672 Running through this again today, I noticed that sometimes, with a completely fresh cluster, the connection in a B2->B1->B1->B2 shutdown/startup does work. But then I do it again and it doesn't. Or if I do the opposite order it breaks as well. I just modified your script so that after the first round of stop/start/check-binding, it flips the order and shuts them down again and starts them up – and yes, I realize this is the opposite order from my ticket – and re-checks bindings and they're gone. I'm attaching the output of your script. (Just for clarification, 10.1.58.3==exp01==A1, 10.1.58.4==exp02==A2, 10.20.58.1==bosmsg01==B1, and 10.20.58.2==bosmsg02==B2. I've been trying to regex the hostnames so you guys didn't have to deal with following my hostnames, but if you guys prefer, I don't mind just using the real names.)
        Hide
        Mark Moseley added a comment -

        This is the output from the script when it does another round of stop/starts with the order flipped the second time around.

        Show
        Mark Moseley added a comment - This is the output from the script when it does another round of stop/starts with the order flipped the second time around.
        Hide
        Mark Moseley added a comment -

        I also rewrote the script to do a B1->B2->B2->B1 shutdown/startup sequence first (the binding was visible after that), then do a B2->B1->B1->B2 stop/start and the binding wasn't there. Maybe it get s a single freebie in a super clean cluster?

        I had originally posted to the list since I figured I was probably doing something wrong, so there could be some conceptual problem on my part, i.e. maybe it's not supposed to work like I'm expecting.

        Show
        Mark Moseley added a comment - I also rewrote the script to do a B1->B2->B2->B1 shutdown/startup sequence first (the binding was visible after that), then do a B2->B1->B1->B2 stop/start and the binding wasn't there. Maybe it get s a single freebie in a super clean cluster? I had originally posted to the list since I figured I was probably doing something wrong, so there could be some conceptual problem on my part, i.e. maybe it's not supposed to work like I'm expecting.
        Hide
        michael j. goulish added a comment -

        Using modifications of Ken's script, I have reproduced two
        bad behaviors, including the one that Mark is reporting.

        I don't think this is a bug... well, sort of. Two, actually.
        I will submit a doc bug, and probably one enhancement request.

        What's happening is this: messaging systems that include
        clusters and stores are sensitive to timing issues around
        events like broker introduction and shut-down.

        Here are the timing issues that I know of:

        1. When you shut down a cluster that is using a store,
        there must be time for the last-broker-standing to
        realize his status, and mark his store as "clean".
        I.e. "my store is the one we should use at re-start."
        If all brokers are killed too quickly, this will not
        happen. The cluster will not be able to restart
        because it will not find any store that has been
        marked "clean".

        2. When you make a topology change, i.e. adding a route
        from one cluster to another to create a federation-of-
        clusters, if you then shut down the cluster soon afterwards
        you may get it before that topology-change has had a chance
        to propagate across the cluster.

        This can cause a problem on re-start that depends on the
        order in which the brokers are killed. If you first kill
        the broker that knew about the topology change before he
        manages to communicate that knowledge to the other broker,
        that's Bad. because the other broker will be the last-man-
        standing, and it will be his store that gets marked as
        "clean"! So his store will be re-used at startup, and the
        cluster will have lost knowledge of the topology change.

        By altering the timing of events in Ken's script, I was able
        to:

        A. get no failures in 200 runs. (original script, plus explicit
        wait-loops fro brokers.)

        B. get 100% failure because of no clean store. (kill both brokers
        in B cluster too close together.)

        C. get the failure that Mark reported, about 7% of the time.
        ( place B1 under load, then kill it too soon after route-
        creation. )

        So, here's what I will propose...

        I. A bit of documentation ( I will take first sketch-whack at it,
        then give to doc professionals) to centralize description of
        this type of problem – the two I have mentioned above, plus
        whatever anyone else thinks up that is similar.

        This will include best-practices on how to avoid this type of
        problem.

        II. A request for enhancement wherever there is no very good way
        to avoid one of these multi-broker race conditions.

        III. III'll come back and update this Jira with the numbers of
        any resultant Jiras that I open.

        Show
        michael j. goulish added a comment - Using modifications of Ken's script, I have reproduced two bad behaviors, including the one that Mark is reporting. I don't think this is a bug... well, sort of. Two, actually. I will submit a doc bug, and probably one enhancement request. What's happening is this: messaging systems that include clusters and stores are sensitive to timing issues around events like broker introduction and shut-down. Here are the timing issues that I know of: 1. When you shut down a cluster that is using a store, there must be time for the last-broker-standing to realize his status, and mark his store as "clean". I.e. "my store is the one we should use at re-start." If all brokers are killed too quickly, this will not happen. The cluster will not be able to restart because it will not find any store that has been marked "clean". 2. When you make a topology change, i.e. adding a route from one cluster to another to create a federation-of- clusters, if you then shut down the cluster soon afterwards you may get it before that topology-change has had a chance to propagate across the cluster. This can cause a problem on re-start that depends on the order in which the brokers are killed. If you first kill the broker that knew about the topology change before he manages to communicate that knowledge to the other broker, that's Bad. because the other broker will be the last-man- standing, and it will be his store that gets marked as "clean"! So his store will be re-used at startup, and the cluster will have lost knowledge of the topology change. By altering the timing of events in Ken's script, I was able to: A. get no failures in 200 runs. (original script, plus explicit wait-loops fro brokers.) B. get 100% failure because of no clean store. (kill both brokers in B cluster too close together.) C. get the failure that Mark reported, about 7% of the time. ( place B1 under load, then kill it too soon after route- creation. ) So, here's what I will propose... I. A bit of documentation ( I will take first sketch-whack at it, then give to doc professionals) to centralize description of this type of problem – the two I have mentioned above, plus whatever anyone else thinks up that is similar. This will include best-practices on how to avoid this type of problem. II. A request for enhancement wherever there is no very good way to avoid one of these multi-broker race conditions. III. III'll come back and update this Jira with the numbers of any resultant Jiras that I open.
        Hide
        michael j. goulish added a comment -

        I don't really want to say "Won't Fix" here – I really want to say "Will make two separate Jiras."

        ( please see my other comment for a complete explanation )

        Show
        michael j. goulish added a comment - I don't really want to say "Won't Fix" here – I really want to say "Will make two separate Jiras." ( please see my other comment for a complete explanation )

          People

          • Assignee:
            michael j. goulish
            Reporter:
            Mark Moseley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development