Qpid
  1. Qpid
  2. QPID-2982

Inconsistent management messages in a cluster with federation routes

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.9
    • Component/s: C++ Clustering
    • Labels:
      None

      Description

      Management messages on different members of a cluster can be different, which can cause a cluster member to fail with an invalid-arg error.

      The cluster_tests.py has a test_management test which exercises management code and collects detailed trace logging. The logs are compared for consistency.

      Currently this test fails. All inconsistencies revealed by the test need to be corrected.

        Activity

        Hide
        Alan Conway added a comment -

        Fix on trunk r1060568

        Show
        Alan Conway added a comment - Fix on trunk r1060568
        Hide
        Alan Conway added a comment -

        The problem is that the link and bridge objects created by route add are not updated when a new member joins the cluster. Thus the first broker has management objects for link and bridge that the second broker does not. This creates inconsistent management messages when a console like qpid-tool is connected.

        Show
        Alan Conway added a comment - The problem is that the link and bridge objects created by route add are not updated when a new member joins the cluster. Thus the first broker has management objects for link and bridge that the second broker does not. This creates inconsistent management messages when a console like qpid-tool is connected.
        Hide
        Alan Conway added a comment -

        I am able to reproduce this, thanks. I'm investigating now.

        Show
        Alan Conway added a comment - I am able to reproduce this, thanks. I'm investigating now.
        Hide
        Mark Moseley added a comment -

        Commenting per Alan's email. Same setup as JIRA 2992. I was getting this with Squeeze corosync (1.2.1) and openais (1.1.2); I upgraded them today (to 1.3.0 and 1.1.4, respectively) and still get this error.

        As in 2992, there's two clusters, A (with nodes A1 and A2) and B (with B1 and B2).

        I'm doing the same setup as in 2992:

        On all boxes in cluster A and B:

        • Kill the qpidd if it's running and delete all existing store files, i.e. contents of /var/lib/qpid/

        On host A1 in cluster A (I'm leaving out the -a user/test@host stuff):

        • Start up qpid
        • qpid-config add exchange direct bosmyex1 --durable
        • qpid-config add exchange direct walmyex1 --durable
        • qpid-config add queue walmyq1 --durable
        • qpid-config bind walmyex1 walmyq1 unix.waltham.cust

        On host B1 in cluster B:

        • qpid-config add exchange direct bosmyex1 --durable
        • qpid-config add exchange direct walmyex1 --durable
        • qpid-config add queue bosmyq1 --durable
        • qpid-config bind bosmyex1 bosmyq1 unix.boston.cust

        On cluster A:

        • Start other member of cluster, A2
        • qpid-route route add amqps://user/pass@HOSTA1:5671 amqps://user/pass@HOSTB1:5671 walmyex1 unix.waltham.cust -d

        On cluster B:

        • Start other member of cluster, B2
        • qpid-route route add amqps://user/pass@HOSTB1:5671 amqps://user/pass@HOSTA1:5671 bosmyex1 unix.boston.cust -d

        On either cluster:

        • Check "qpid-config exchanges --bindings" to make sure bindings are correct for remote exchanges
        • To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start cluster back up, check bindings.
        • To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start cluster back up, check bindings.

        If on A1, I run qpid-tool and do something like 'list' enough times that it returns data (instead of the first few times where it returns nothing), on node A2, I get the following very often This is the log on A2 all the way from startup, which is only 10-15 secs earlier:

        2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/watchdog.so
        2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/acl.so
        2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/xml.so
        2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/replicating_listener.so
        2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/replication_exchange.so
        2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/ssl.so
        2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/msgstore.so
        2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/cluster.so
        2011-01-07 19:35:30 info Management enabled
        2011-01-07 19:35:30 info Registered xml exchange
        2011-01-07 19:35:30 info Registered replication exchange
        2011-01-07 19:35:31 notice Journal "TplStore": Created
        2011-01-07 19:35:31 notice Store module initialized; store-dir=/var/lib/qpid
        2011-01-07 19:35:31 info > Default files per journal: 8
        2011-01-07 19:35:31 info > Default journal file size: 24 (wpgs)
        2011-01-07 19:35:31 info > Default write cache page size: 32 (Kib)
        2011-01-07 19:35:31 info > Default number of write cache pages: 32
        2011-01-07 19:35:31 info > TPL files per journal: 8
        2011-01-07 19:35:31 info > TPL journal file size: 24 (wpgs)
        2011-01-07 19:35:31 info > TPL write cache page size: 4 (Kib)
        2011-01-07 19:35:31 info > TPL number of write cache pages: 64
        2011-01-07 19:35:31 notice Initializing CPG
        2011-01-07 19:35:31 notice Cluster store state: dirty cluster-id=54ef8a97-b8e8-47a2-935d-e619e4d7c895
        2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) configuration change: 101.0.0.0:10552 102.0.0.0:13904
        2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) Members joined: 102.0.0.0:13904
        2011-01-07 19:35:31 notice Cluster recovery: recovered journal data discarded and journal files pushed down
        2011-01-07 19:35:31 notice Store directory /var/lib/qpid/rhm was pushed down (saved) into directory /var/lib/qpid/_cluster.bak.0002.
        2011-01-07 19:35:31 notice Journal "TplStore": Created
        2011-01-07 19:35:31 info SASL enabled
        2011-01-07 19:35:31 notice Listening on TCP port 5672
        2011-01-07 19:35:31 notice Read ACL file "/etc/qpid/acl.conf"
        2011-01-07 19:35:31 notice Listening for SSL connections on TCP port 5671
        2011-01-07 19:35:31 info cluster(102.0.0.0:13904 INIT) not active for links.
        2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 JOINER) joining cluster walclust
        2011-01-07 19:35:31 notice Broker running
        2011-01-07 19:35:31 info cluster(102.0.0.0:13904 JOINER) stall for update offer from 101.0.0.0:10552 to 102.0.0.0:13904
        2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 UPDATEE) receiving update from 101.0.0.0:10552
        2011-01-07 19:35:31 notice Journal "walmyq1": Created
        2011-01-07 19:35:31 info Queue "walmyq1": Policy created: type=reject; maxCount=0; maxSize=104857600
        2011-01-07 19:35:31 info Queue "qpid.cluster-update": Policy created: type=reject; maxCount=0; maxSize=104857600
        2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-6
        2011-01-07 19:35:31 info Queue "bridge_queue_1_d0638d51-8b76-48d0-a232-10e8bbed772d": Policy created: type=reject; maxCount=0; maxSize=104857600
        2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-7
        2011-01-07 19:35:31 info Queue "reply-HOSTA1.10654.1": Policy created: type=reject; maxCount=0; maxSize=104857600
        2011-01-07 19:35:31 info Queue "topic-HOSTA1.10654.1": Policy created: type=ring; maxCount=0; maxSize=104857600
        2011-01-07 19:35:31 info Queue "qmfc-v2-HOSTA1.10654.1": Policy created: type=reject; maxCount=0; maxSize=104857600
        2011-01-07 19:35:31 info Queue "qmfc-v2-ui-HOSTA1.10654.1": Policy created: type=ring; maxCount=0; maxSize=104857600
        2011-01-07 19:35:31 info Queue "qmfc-v2-hb-HOSTA1.10654.1": Policy created: type=ring; maxCount=0; maxSize=104857600
        2011-01-07 19:35:32 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-9
        2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update: 101.0.0.0:10552(member) 102.0.0.0:13904(unknown) frameSeq=2427
        2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 CATCHUP) update complete, starting catch-up.
        2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update: 101.0.0.0:10552(member) 102.0.0.0:13904(member) frameSeq=2427
        2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 READY) caught up.
        2011-01-07 19:35:32 info Passivating links
        2011-01-07 19:35:39 info Queue "reply-HOSTA1.10670.1": Policy created: type=reject; maxCount=0; maxSize=104857600
        2011-01-07 19:35:40 info Queue "topic-HOSTA1.10670.1": Policy created: type=ring; maxCount=0; maxSize=104857600
        2011-01-07 19:35:40 info Queue "qmfc-v2-HOSTA1.10670.1": Policy created: type=reject; maxCount=0; maxSize=104857600
        2011-01-07 19:35:40 info Queue "qmfc-v2-ui-HOSTA1.10670.1": Policy created: type=ring; maxCount=0; maxSize=104857600
        2011-01-07 19:35:40 info Queue "qmfc-v2-hb-HOSTA1.10670.1": Policy created: type=ring; maxCount=0; maxSize=104857600
        2011-01-07 19:35:51 error Execution exception: invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed < (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154)
        2011-01-07 19:35:51 critical cluster(102.0.0.0:13904 READY/error) local error 2770 did not occur on member 101.0.0.0:10552: invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed < (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154)
        2011-01-07 19:35:51 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed < (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)
        2011-01-07 19:35:51 notice cluster(102.0.0.0:13904 LEFT/error) leaving cluster walclust
        2011-01-07 19:35:51 notice Shut down

        In the logs above:

        • 'mark' is the qpid SASL username I'm using. It's definitely valid on both nodes, as I can use it in qpid-config just fine in A1 and A2. I also use it in qpid-tool successfully.
        • Those "Policy created" messages show up when I initially did 'list' in qpid-tool on A1.
        • When 'list' returned data on A1, A2 shut down.

        I don't get the above error 100% of the time. Sometimes it'll work consistently and then other times it won't, and won't for a while (and A2 will shut back down almost immediately after rejoining the cluster). The only times that I can consistently get it to crash is if I replicate the broken scenario in JIRA 2992. After doing a fail of B2->B1, A2 will crash as soon as qpid-tool starts returning data (but not before), even if I restart A1 and A2 repeatedly.

        These are test clusters, so there's nothing else going on on them. As well, I can perform whatever testing would be useful to the Qpid team.

        Show
        Mark Moseley added a comment - Commenting per Alan's email. Same setup as JIRA 2992. I was getting this with Squeeze corosync (1.2.1) and openais (1.1.2); I upgraded them today (to 1.3.0 and 1.1.4, respectively) and still get this error. As in 2992, there's two clusters, A (with nodes A1 and A2) and B (with B1 and B2). I'm doing the same setup as in 2992: On all boxes in cluster A and B: Kill the qpidd if it's running and delete all existing store files, i.e. contents of /var/lib/qpid/ On host A1 in cluster A (I'm leaving out the -a user/test@host stuff): Start up qpid qpid-config add exchange direct bosmyex1 --durable qpid-config add exchange direct walmyex1 --durable qpid-config add queue walmyq1 --durable qpid-config bind walmyex1 walmyq1 unix.waltham.cust On host B1 in cluster B: qpid-config add exchange direct bosmyex1 --durable qpid-config add exchange direct walmyex1 --durable qpid-config add queue bosmyq1 --durable qpid-config bind bosmyex1 bosmyq1 unix.boston.cust On cluster A: Start other member of cluster, A2 qpid-route route add amqps://user/pass@HOSTA1:5671 amqps://user/pass@HOSTB1:5671 walmyex1 unix.waltham.cust -d On cluster B: Start other member of cluster, B2 qpid-route route add amqps://user/pass@HOSTB1:5671 amqps://user/pass@HOSTA1:5671 bosmyex1 unix.boston.cust -d On either cluster: Check "qpid-config exchanges --bindings" to make sure bindings are correct for remote exchanges To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start cluster back up, check bindings. To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start cluster back up, check bindings. If on A1, I run qpid-tool and do something like 'list' enough times that it returns data (instead of the first few times where it returns nothing), on node A2, I get the following very often This is the log on A2 all the way from startup, which is only 10-15 secs earlier: 2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/watchdog.so 2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/acl.so 2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/xml.so 2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/replicating_listener.so 2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/replication_exchange.so 2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/ssl.so 2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/msgstore.so 2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/cluster.so 2011-01-07 19:35:30 info Management enabled 2011-01-07 19:35:30 info Registered xml exchange 2011-01-07 19:35:30 info Registered replication exchange 2011-01-07 19:35:31 notice Journal "TplStore": Created 2011-01-07 19:35:31 notice Store module initialized; store-dir=/var/lib/qpid 2011-01-07 19:35:31 info > Default files per journal: 8 2011-01-07 19:35:31 info > Default journal file size: 24 (wpgs) 2011-01-07 19:35:31 info > Default write cache page size: 32 (Kib) 2011-01-07 19:35:31 info > Default number of write cache pages: 32 2011-01-07 19:35:31 info > TPL files per journal: 8 2011-01-07 19:35:31 info > TPL journal file size: 24 (wpgs) 2011-01-07 19:35:31 info > TPL write cache page size: 4 (Kib) 2011-01-07 19:35:31 info > TPL number of write cache pages: 64 2011-01-07 19:35:31 notice Initializing CPG 2011-01-07 19:35:31 notice Cluster store state: dirty cluster-id=54ef8a97-b8e8-47a2-935d-e619e4d7c895 2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) configuration change: 101.0.0.0:10552 102.0.0.0:13904 2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) Members joined: 102.0.0.0:13904 2011-01-07 19:35:31 notice Cluster recovery: recovered journal data discarded and journal files pushed down 2011-01-07 19:35:31 notice Store directory /var/lib/qpid/rhm was pushed down (saved) into directory /var/lib/qpid/_cluster.bak.0002. 2011-01-07 19:35:31 notice Journal "TplStore": Created 2011-01-07 19:35:31 info SASL enabled 2011-01-07 19:35:31 notice Listening on TCP port 5672 2011-01-07 19:35:31 notice Read ACL file "/etc/qpid/acl.conf" 2011-01-07 19:35:31 notice Listening for SSL connections on TCP port 5671 2011-01-07 19:35:31 info cluster(102.0.0.0:13904 INIT) not active for links. 2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 JOINER) joining cluster walclust 2011-01-07 19:35:31 notice Broker running 2011-01-07 19:35:31 info cluster(102.0.0.0:13904 JOINER) stall for update offer from 101.0.0.0:10552 to 102.0.0.0:13904 2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 UPDATEE) receiving update from 101.0.0.0:10552 2011-01-07 19:35:31 notice Journal "walmyq1": Created 2011-01-07 19:35:31 info Queue "walmyq1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-01-07 19:35:31 info Queue "qpid.cluster-update": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-6 2011-01-07 19:35:31 info Queue "bridge_queue_1_d0638d51-8b76-48d0-a232-10e8bbed772d": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-7 2011-01-07 19:35:31 info Queue "reply-HOSTA1.10654.1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-01-07 19:35:31 info Queue "topic-HOSTA1.10654.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-01-07 19:35:31 info Queue "qmfc-v2-HOSTA1.10654.1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-01-07 19:35:31 info Queue "qmfc-v2-ui-HOSTA1.10654.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-01-07 19:35:31 info Queue "qmfc-v2-hb-HOSTA1.10654.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-01-07 19:35:32 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 101.0.0.0:10552-9 2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update: 101.0.0.0:10552(member) 102.0.0.0:13904(unknown) frameSeq=2427 2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 CATCHUP) update complete, starting catch-up. 2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update: 101.0.0.0:10552(member) 102.0.0.0:13904(member) frameSeq=2427 2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 READY) caught up. 2011-01-07 19:35:32 info Passivating links 2011-01-07 19:35:39 info Queue "reply-HOSTA1.10670.1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-01-07 19:35:40 info Queue "topic-HOSTA1.10670.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-01-07 19:35:40 info Queue "qmfc-v2-HOSTA1.10670.1": Policy created: type=reject; maxCount=0; maxSize=104857600 2011-01-07 19:35:40 info Queue "qmfc-v2-ui-HOSTA1.10670.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-01-07 19:35:40 info Queue "qmfc-v2-hb-HOSTA1.10670.1": Policy created: type=ring; maxCount=0; maxSize=104857600 2011-01-07 19:35:51 error Execution exception: invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed < (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154) 2011-01-07 19:35:51 critical cluster(102.0.0.0:13904 READY/error) local error 2770 did not occur on member 101.0.0.0:10552: invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed < (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154) 2011-01-07 19:35:51 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: mark@QPID.HOSTA1.10670.1: confirmed < (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89) 2011-01-07 19:35:51 notice cluster(102.0.0.0:13904 LEFT/error) leaving cluster walclust 2011-01-07 19:35:51 notice Shut down In the logs above: 'mark' is the qpid SASL username I'm using. It's definitely valid on both nodes, as I can use it in qpid-config just fine in A1 and A2. I also use it in qpid-tool successfully. Those "Policy created" messages show up when I initially did 'list' in qpid-tool on A1. When 'list' returned data on A1, A2 shut down. I don't get the above error 100% of the time. Sometimes it'll work consistently and then other times it won't, and won't for a while (and A2 will shut back down almost immediately after rejoining the cluster). The only times that I can consistently get it to crash is if I replicate the broken scenario in JIRA 2992. After doing a fail of B2->B1, A2 will crash as soon as qpid-tool starts returning data (but not before), even if I restart A1 and A2 repeatedly. These are test clusters, so there's nothing else going on on them. As well, I can perform whatever testing would be useful to the Qpid team.

          People

          • Assignee:
            Alan Conway
            Reporter:
            Alan Conway
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development