Uploaded image for project: 'ActiveMQ C++ Client'
  1. ActiveMQ C++ Client
  2. AMQCPP-465

Periodic access violation originating from Openwire::unmarshal

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.6.0
    • 3.6.0
    • Openwire
    • None
    • Windows XPSP3 VS2005

    Description

      Running a sustained test produced 5 separate access violations all within the Openwire::unmarshal. It appears only one occurred during termination. Attaching all 5 dumps.

      Attachments

        1. CrashHang_Report__CMStressD.exe__02222013114744533.mht
          234 kB
          Scott Weaver
        2. CrashHang_Report__PID_10568__02212013121958303.mht
          1.65 MB
          Scott Weaver
        3. CrashHang_Report__PID_12904__02212013122442266.mht
          579 kB
          Scott Weaver
        4. CrashHang_Report__PID_13840__02212013122742886.mht
          1.67 MB
          Scott Weaver
        5. CrashHang_Report__PID_3816__02212013122319960.mht
          1.04 MB
          Scott Weaver
        6. CrashHang_Report__PID_8784__02212013122829943.mht
          418 kB
          Scott Weaver
        7. TcpSocket.cpp
          25 kB
          Timothy A. Bish
        8. TcpSocket.h
          5 kB
          Timothy A. Bish

        Activity

          sw124773 Scott Weaver added a comment -

          Automated testing latest SNAPSHOT for almost 24 hours now with no issues to report. Looks good!

          sw124773 Scott Weaver added a comment - Automated testing latest SNAPSHOT for almost 24 hours now with no issues to report. Looks good!

          Cut the trunk code over to a 3.6.x branch today, next release will come from there next week. Code is that same as latest SNAPSHOT src bundle. Speak up if you find anything wrong with the SNAPSHOT before Monday as I will start the vote then otherwise.

          tabish Timothy A. Bish added a comment - Cut the trunk code over to a 3.6.x branch today, next release will come from there next week. Code is that same as latest SNAPSHOT src bundle. Speak up if you find anything wrong with the SNAPSHOT before Monday as I will start the vote then otherwise.

          SNAPSHOT was updated last night with some additional changes to get a better exception propagated to the client when things like security exceptions are sent back from the broker. You might want to pull that down and sanity test. I don't envision any further changes unless new critical bugs get reported in the next few days. I plan on rolling a release candidate next week time permitting.

          tabish Timothy A. Bish added a comment - SNAPSHOT was updated last night with some additional changes to get a better exception propagated to the client when things like security exceptions are sent back from the broker. You might want to pull that down and sanity test. I don't envision any further changes unless new critical bugs get reported in the next few days. I plan on rolling a release candidate next week time permitting.
          sw124773 Scott Weaver added a comment -

          48 hours and still going strong. We really like this version. Can we get a 3.6 release/freeze for final validation? Thanks!

          sw124773 Scott Weaver added a comment - 48 hours and still going strong. We really like this version. Can we get a 3.6 release/freeze for final validation? Thanks!

          Thanks for closing the loop on this. Appreciate all the testing you've done, its a big help.

          tabish Timothy A. Bish added a comment - Thanks for closing the loop on this. Appreciate all the testing you've done, its a big help.
          sw124773 Scott Weaver added a comment -

          Ran over the weekend for 39 hours straight until a hang was encountered. More than just the test was hung so cannot attribuite it to this code. Restarted the test yesterday without an attached debugger and have been running for over 26 hours straight with no issues. We are confident this access violation has been fixed. I will mark this JIRA resolved and let you close it once you have a release containing the fixes. Thanks again for hardening this component. We have gone from not being able to run extreme tests for more than ten minutes to at least 39 hours now.

          sw124773 Scott Weaver added a comment - Ran over the weekend for 39 hours straight until a hang was encountered. More than just the test was hung so cannot attribuite it to this code. Restarted the test yesterday without an attached debugger and have been running for over 26 hours straight with no issues. We are confident this access violation has been fixed. I will mark this JIRA resolved and let you close it once you have a release containing the fixes. Thanks again for hardening this component. We have gone from not being able to run extreme tests for more than ten minutes to at least 39 hours now.

          Pushed new SNAPSHOT that includes the TcpSocket changes. From my testing this seems to be working fine. Requires some hoop jumping to keep ServerSocket working as we get into some tangles because of the APR memory pools and general limitations of APR but no issues in the client code should link to that as the ServerSocket code doesn't come into play there.

          http://people.apache.org/~tabish/cms-3.6.x/

          tabish Timothy A. Bish added a comment - Pushed new SNAPSHOT that includes the TcpSocket changes. From my testing this seems to be working fine. Requires some hoop jumping to keep ServerSocket working as we get into some tangles because of the APR memory pools and general limitations of APR but no issues in the client code should link to that as the ServerSocket code doesn't come into play there. http://people.apache.org/~tabish/cms-3.6.x/

          If you'd like to try out an experimental fix you can replace your versions with these and see how it goes. Haven't tried this on windows so YMMV.

          tabish Timothy A. Bish added a comment - If you'd like to try out an experimental fix you can replace your versions with these and see how it goes. Haven't tried this on windows so YMMV.
          sw124773 Scott Weaver added a comment -

          Your assumption appears to be absolutely correct. Adding another dump with thread 2 closing the IOTransport and then joining thread 7 which is in TcpSocket::read. Line 611 checked the socket and it was not closed there but it was closed by the time the thread reached line 649 where socketHandle was NULL. Note that it took 10.5 hours to recreate. Looks like about a small 9 LOC window. The write probably suffers from the same issue but probably not as frequent.

          sw124773 Scott Weaver added a comment - Your assumption appears to be absolutely correct. Adding another dump with thread 2 closing the IOTransport and then joining thread 7 which is in TcpSocket::read. Line 611 checked the socket and it was not closed there but it was closed by the time the thread reached line 649 where socketHandle was NULL. Note that it took 10.5 hours to recreate. Looks like about a small 9 LOC window. The write probably suffers from the same issue but probably not as frequent.

          My guess is that the issue is in TcpSocket. I'm thinking that the socket is closed right before an attempt by IOTransport to enter another blocking read and since the code closes the socketHandle in an attempt unblock the readers that the recv call ends up using a bad value in apr_socket_recv.

          tabish Timothy A. Bish added a comment - My guess is that the issue is in TcpSocket. I'm thinking that the socket is closed right before an attempt by IOTransport to enter another blocking read and since the code closes the socketHandle in an attempt unblock the readers that the recv call ends up using a bad value in apr_socket_recv.
          sw124773 Scott Weaver added a comment -

          Really difficult to recreate. Switched to interactive debugger so when it happens again we can look around. What should we look at first?

          sw124773 Scott Weaver added a comment - Really difficult to recreate. Switched to interactive debugger so when it happens again we can look around. What should we look at first?

          Not sure on this one. Probably going to be a tough one to track down. I have theories but without a way to reproduce its hard to say.

          tabish Timothy A. Bish added a comment - Not sure on this one. Probably going to be a tough one to track down. I have theories but without a way to reproduce its hard to say.
          sw124773 Scott Weaver added a comment -

          Everything is the latest:
          APR 1.4.6, APR-util 1.5.1, APR-iconv 1.2.1

          sw124773 Scott Weaver added a comment - Everything is the latest: APR 1.4.6, APR-util 1.5.1, APR-iconv 1.2.1

          I'd make sure you are using whatever the latest APR library is as a first step.

          tabish Timothy A. Bish added a comment - I'd make sure you are using whatever the latest APR library is as a first step.
          sw124773 Scott Weaver added a comment -

          All dumps appear to be different view of the same issue.

          sw124773 Scott Weaver added a comment - All dumps appear to be different view of the same issue.

          People

            tabish Timothy A. Bish
            sw124773 Scott Weaver
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: