Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
I was trying out Rajeshbabu's new changes in RATIS-541 using the docker automation, but gave invalid options the first time which caused the workers to exit (divide by zero).
When I tried to rerun the VerificationTool, I found that the tool got stuck waiting for logs to be created. Getting a thread dump from the active leader of the metadata quorum showed 150+ threads all stuck waiting to get a write lock. However, there are no threads holding the lock that everyone is waiting on which seems to me like a deadlock.
It seems like we have some kind of bug where we orphan a lock that's still held. This doesn't happen normally - makes me wonder if it can happen when the leader changes? I'll attach the log of the metadata quorum nodes from my local test. However, I bet this could be reproduced with some adequate load.
Can you take a look into this, Vlad?