Qpid
  1. Qpid
  2. QPID-3129

cluster_tests.LongTests.test_failover hangs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 0.11
    • Component/s: C++ Clustering
    • Labels:
      None

      Description

      The test cluster_tests.LongTests.test_failover hangs if run with duration of a minute or more.

      To reproduce:
      make check TESTS=run_cluster_tests CLUSTER_TESTS='cluster_tests.LongTests.test_failover -DDURATION=1'

      Should pass after about 1 minute, but hangs indefinitely.

      This is probably a test bug.

        Activity

        Hide
        Alan Conway added a comment -

        Fixed in r1080786

        Show
        Alan Conway added a comment - Fixed in r1080786
        Hide
        Alan Conway added a comment -

        Still seeing hangs, run in a loop:
        run_cluster_tests *.test_failover -DDURATION=4
        normally hangs the first time.

        Show
        Alan Conway added a comment - Still seeing hangs, run in a loop: run_cluster_tests *.test_failover -DDURATION=4 normally hangs the first time.
        Hide
        Alan Conway added a comment - - edited

        Running with trace logging on, in all cases one of the clients (usually sender but sometimes receiver) hangs with the last thing it says being something like:

        2011-03-23 14:47:17 debug RECV [41935 10.16.44.237:41935] INIT(0-10)

        Note that local and remote port on the connection are the same. This test is running on a single host so clearly that's not a valid TCP connection. So far I have no idea why this is happening but it seems to be the cause of the hang.

        With trace logging on it takes more iterations to reproduce with -DDURATION=4, I'm seeing up to 5.

        Show
        Alan Conway added a comment - - edited Running with trace logging on, in all cases one of the clients (usually sender but sometimes receiver) hangs with the last thing it says being something like: 2011-03-23 14:47:17 debug RECV [41935 10.16.44.237:41935] INIT(0-10) Note that local and remote port on the connection are the same . This test is running on a single host so clearly that's not a valid TCP connection. So far I have no idea why this is happening but it seems to be the cause of the hang. With trace logging on it takes more iterations to reproduce with -DDURATION=4, I'm seeing up to 5.
        Hide
        Alan Conway added a comment -

        Revision 1087052 fixes a race condition in posix/Socket.cpp Socket::connect that was causing the hangs with the same local/remote port. However there is another issue causing a hang (less frequent). In this case either sender or receiver hangs with a stack trace that seems to indicate an exception thrown in the FailoverUpdates thread (below). The main thread is hung in join() waiting for the FailoverUpdates thread to exit. It appears that the thread is hung or looping in gcc exception handling code. The traces are not always identical but they always have some exception-related functions being called in the FailoverUpdate thread.

        Example:

        #0 0x0000003507c0aa77 in ?? () from /lib64/libgcc_s.so.1
        #1 0x0000003505108a07 in dl_iterate_phdr () from /lib64/libc.so.6
        #2 0x0000003507c0a6b6 in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
        #3 0x0000003507c07625 in ?? () from /lib64/libgcc_s.so.1
        #4 0x0000003507c08f79 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
        #5 0x00000035094bcf1d in __cxa_throw () from /usr/lib64/libstdc++.so.6
        #6 0x00002ba6b5b306cd in qpid::sys::ExceptionHolder::Wrapper<qpid::TransportFailure>::raise (this=<value optimized out>) at ../../qpid/cpp/include/qpid/sys/ExceptionHolder.h:62
        #7 0x00002ba6b5b53381 in raise (this=<value optimized out>) at ../../qpid/cpp/include/qpid/sys/ExceptionHolder.h:53
        #8 qpid::client::SessionImpl::check (this=<value optimized out>) at ../../qpid/cpp/src/qpid/client/SessionImpl.cpp:775
        #9 0x00002ba6b5b57ed7 in qpid::client::SessionImpl::checkOpen (this=0x257) at ../../qpid/cpp/src/qpid/client/SessionImpl.cpp:780
        #10 0x00002ba6b5b5af18 in qpid::client::SessionImpl::sendCommand (this=0x19f91710, command=..., content=0x0) at ../../qpid/cpp/src/qpid/client/SessionImpl.cpp:383
        #11 0x00002ba6b5b5b1ea in qpid::client::SessionImpl::send (this=0x274, command=...) at ../../qpid/cpp/src/qpid/client/SessionImpl.cpp:301
        #12 0x00002ba6b5b0ae49 in qpid::client::no_keyword::AsyncSession_0_10::messageSetFlowMode (this=<value optimized out>, destination=<value optimized out>, flowMode=0 '\000',
        sync=false) at ./qpid/client/no_keyword/AsyncSession_0_10.cpp:111
        #13 0x00002ba6b588bd6c in messageSetFlowMode_with_named_params<boost::parameter::aux::arg_list<boost::parameter::aux::tagged_argument<qpid::client::arg::keyword_tags::destination, std\
        ::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, boost::parameter::aux::arg_list<boost::parameter::aux::tagged_argument<qpid::client::arg::keyword_tags::flo\
        wMode, qpid::framing::message::FlowMode const>, boost::parameter::aux::empty_arg_list> > > (this=0x19ee4220, message=..., timeout=...)
        at ../include/qpid/client/AsyncSession_0_10.h:366
        #14 messageSetFlowMode<std::string, qpid::framing::message::FlowMode> (this=0x19ee4220, message=..., timeout=...) at ../include/qpid/client/AsyncSession_0_10.h:362
        #15 qpid::client::amqp0_10::ReceiverImpl::fetchImpl (this=0x19ee4220, message=..., timeout=...) at ../../qpid/cpp/src/qpid/client/amqp0_10/ReceiverImpl.cpp:168
        #16 0x00002ba6b588df14 in operator() (this=0x19f2af00, f=...) at ../../qpid/cpp/src/qpid/client/amqp0_10/ReceiverImpl.h:118
        #17 qpid::client::amqp0_10::SessionImpl::execute<qpid::client::amqp0_10::ReceiverImpl::Fetch> (this=0x19f2af00, f=...) at ../../qpid/cpp/src/qpid/client/amqp0_10/SessionImpl.h:99
        #18 0x00002ba6b588c3e8 in qpid::client::amqp0_10::ReceiverImpl::fetch (this=0x19ee4220, message=<value optimized out>, timeout=...)
        at ../../qpid/cpp/src/qpid/client/amqp0_10/ReceiverImpl.cpp:71
        #19 0x00002ba6b5866225 in qpid::messaging::Receiver::fetch (this=<value optimized out>, message=..., timeout=...) at ../../qpid/cpp/src/qpid/messaging/Receiver.cpp:38
        #20 0x00002ba6b5867937 in qpid::messaging::FailoverUpdatesImpl::run (this=0x19f2a6d0) at ../../qpid/cpp/src/qpid/messaging/FailoverUpdates.cpp:63
        #21 0x00002ba6b5ecac2a in qpid::sys::(anonymous namespace)::runRunnable (p=0x257) at ../../qpid/cpp/src/qpid/sys/posix/Thread.cpp:35
        #22 0x0000003505c0673d in start_thread () from /lib64/libpthread.so.0
        #23 0x00000035050d40cd in clone () from /lib64/libc.so.6

        Show
        Alan Conway added a comment - Revision 1087052 fixes a race condition in posix/Socket.cpp Socket::connect that was causing the hangs with the same local/remote port. However there is another issue causing a hang (less frequent). In this case either sender or receiver hangs with a stack trace that seems to indicate an exception thrown in the FailoverUpdates thread (below). The main thread is hung in join() waiting for the FailoverUpdates thread to exit. It appears that the thread is hung or looping in gcc exception handling code. The traces are not always identical but they always have some exception-related functions being called in the FailoverUpdate thread. Example: #0 0x0000003507c0aa77 in ?? () from /lib64/libgcc_s.so.1 #1 0x0000003505108a07 in dl_iterate_phdr () from /lib64/libc.so.6 #2 0x0000003507c0a6b6 in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 #3 0x0000003507c07625 in ?? () from /lib64/libgcc_s.so.1 #4 0x0000003507c08f79 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1 #5 0x00000035094bcf1d in __cxa_throw () from /usr/lib64/libstdc++.so.6 #6 0x00002ba6b5b306cd in qpid::sys::ExceptionHolder::Wrapper<qpid::TransportFailure>::raise (this=<value optimized out>) at ../../qpid/cpp/include/qpid/sys/ExceptionHolder.h:62 #7 0x00002ba6b5b53381 in raise (this=<value optimized out>) at ../../qpid/cpp/include/qpid/sys/ExceptionHolder.h:53 #8 qpid::client::SessionImpl::check (this=<value optimized out>) at ../../qpid/cpp/src/qpid/client/SessionImpl.cpp:775 #9 0x00002ba6b5b57ed7 in qpid::client::SessionImpl::checkOpen (this=0x257) at ../../qpid/cpp/src/qpid/client/SessionImpl.cpp:780 #10 0x00002ba6b5b5af18 in qpid::client::SessionImpl::sendCommand (this=0x19f91710, command=..., content=0x0) at ../../qpid/cpp/src/qpid/client/SessionImpl.cpp:383 #11 0x00002ba6b5b5b1ea in qpid::client::SessionImpl::send (this=0x274, command=...) at ../../qpid/cpp/src/qpid/client/SessionImpl.cpp:301 #12 0x00002ba6b5b0ae49 in qpid::client::no_keyword::AsyncSession_0_10::messageSetFlowMode (this=<value optimized out>, destination=<value optimized out>, flowMode=0 '\000', sync=false) at ./qpid/client/no_keyword/AsyncSession_0_10.cpp:111 #13 0x00002ba6b588bd6c in messageSetFlowMode_with_named_params<boost::parameter::aux::arg_list<boost::parameter::aux::tagged_argument<qpid::client::arg::keyword_tags::destination, std\ ::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, boost::parameter::aux::arg_list<boost::parameter::aux::tagged_argument<qpid::client::arg::keyword_tags::flo\ wMode, qpid::framing::message::FlowMode const>, boost::parameter::aux::empty_arg_list> > > (this=0x19ee4220, message=..., timeout=...) at ../include/qpid/client/AsyncSession_0_10.h:366 #14 messageSetFlowMode<std::string, qpid::framing::message::FlowMode> (this=0x19ee4220, message=..., timeout=...) at ../include/qpid/client/AsyncSession_0_10.h:362 #15 qpid::client::amqp0_10::ReceiverImpl::fetchImpl (this=0x19ee4220, message=..., timeout=...) at ../../qpid/cpp/src/qpid/client/amqp0_10/ReceiverImpl.cpp:168 #16 0x00002ba6b588df14 in operator() (this=0x19f2af00, f=...) at ../../qpid/cpp/src/qpid/client/amqp0_10/ReceiverImpl.h:118 #17 qpid::client::amqp0_10::SessionImpl::execute<qpid::client::amqp0_10::ReceiverImpl::Fetch> (this=0x19f2af00, f=...) at ../../qpid/cpp/src/qpid/client/amqp0_10/SessionImpl.h:99 #18 0x00002ba6b588c3e8 in qpid::client::amqp0_10::ReceiverImpl::fetch (this=0x19ee4220, message=<value optimized out>, timeout=...) at ../../qpid/cpp/src/qpid/client/amqp0_10/ReceiverImpl.cpp:71 #19 0x00002ba6b5866225 in qpid::messaging::Receiver::fetch (this=<value optimized out>, message=..., timeout=...) at ../../qpid/cpp/src/qpid/messaging/Receiver.cpp:38 #20 0x00002ba6b5867937 in qpid::messaging::FailoverUpdatesImpl::run (this=0x19f2a6d0) at ../../qpid/cpp/src/qpid/messaging/FailoverUpdates.cpp:63 #21 0x00002ba6b5ecac2a in qpid::sys::(anonymous namespace)::runRunnable (p=0x257) at ../../qpid/cpp/src/qpid/sys/posix/Thread.cpp:35 #22 0x0000003505c0673d in start_thread () from /lib64/libpthread.so.0 #23 0x00000035050d40cd in clone () from /lib64/libc.so.6
        Hide
        Alan Conway added a comment -

        Fixed on trunk r1137657

        Show
        Alan Conway added a comment - Fixed on trunk r1137657

          People

          • Assignee:
            Alan Conway
            Reporter:
            Alan Conway
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development