Details
-
Bug
-
Status: Resolved
-
Low
-
Resolution: Fixed
-
None
-
Low
Description
I have found this new issue during my test, whenever node is being removed then hint file for that node gets written and stays inside the hint directory forever. I debugged the code and found that it is due to the race condition between HintsWriteExecutor.java::flush and HintsWriteExecutor.java::closeWriter
.
Time t1 Node is down, as a result Hints are being written by HintsWriteExecutor.java::flush
Time t2 Node is removed from cluster as a result it calls HintsService.java-exciseStore which removes hint files for the node being removed
Time t3 Mutation stage keeps pumping Hints through HintService.java::write which again calls HintsWriteExecutor.java::flush and new orphan file gets created
I was writing a new dtest for
{CASSANDRA-13562, CASSANDRA-13308}and that helped me reproduce this new bug. I will submit patch for this new dtest later.
I also tried following to check how this orphan hint file responds:
1. I tried nodetool truncatehints <node> but it fails as node is no longer part of the ring
2. I then tried nodetool truncatehints, that still doesn’t remove hint file because it is not yet included in the dispatchDequeue
Reproducible steps:
Please find dTest python file gossip_hang_test.py attached which reproduces this bug.
Solution:
This is due to race condition as mentioned above. Since HintsWriteExecutor.java creates thread pool with only 1 worker, so solution becomes little simple. Whenever we HintService.java::excise a host, just store it in-memory, and check for already evicted host inside HintsWriteExecutor.java::flush . If already evicted host is found then ignore hints.
Jaydeep
Attachments
Attachments
Issue Links
- links to