Recently, we observed an issue on our production environment where we can see that BalancedProviderFuture.sync method during connection recovery is stuck forever and never returns. We have observed this in 2 hosts in last one week, the only solution is to restart the server.
I am attaching the thread dump which indicates the issue and how it blocks other threads, thread-dump.txt will have details of all the threads.
- This issue is happening on connection recovery during failover from one server to another.
- By debugging I can see that BalancedProviderFuture.sync method is waiting for its state to be updated, and its state is updated by AmqpProvider thread. In thread dump I don't see any AmqpProvider thread which is in stuck state which indicates that AmqpProvider has done its job but still the state for given BalancedProviderFuture object is not updated.
- In the successful event, I can see that the state of BalancedProviderFuture object is updated in below sequence:
- JmsSession.onConnectionRecovery method calls provider.create after creating BalancedProviderFuture object.
- provider.create (aka AmqpProvider.create) is start a thread using serializer, this create method has proper handling and it either calls pumpToProtonTransport OR request.onFailure(which will update the state of BalancedProviderFuture in case of exception).
- Once the above thread gets finished(basically after pumpToProtonTransport), the serializer will call the AmqpProvider.onData method which will update the state of BalancedProviderFuture object.
- I have observed that if we get the exception in AmqpProvider.onData method then the state of BalancedProviderFuture is not getting updated and the BalancedProviderFuture.sync method gets stuck forever, the exception can come in case of protonTransport tail is closed already(probably because of idle timeout issue OR any other transport related issue).
- I have also observed that in some cases(of idle timeout OR transport errors) after completion of a thread which was started by provider.create (aka AmqpProvider.create), the serializer is not calling AmqpProvider.onData but instead it calls AmqpProvider.onTransportError OR AmqpProvider.onTransportClosed and I can not see any handling of updating the state of BalancedProviderFuture object in onTransportError OR onTransportClosed method.
- I am attaching some logs.txt which shows some errors, these error came when the state of BalancedProviderFuture is not updated and sync mehod stuck forever.
- Please note we are using URL - failover:(amqp://localhost:5672
,amqp://localhost:5682)?jms.sendTimeout=5000 and qpid version 0.42.0.
Can someone please take a look at this as this becomes critical issue in our production environment and we don't have any option except restart of our services?