Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-13740

Orphan hint file gets created while node is being removed from cluster

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Severity:
      Low

      Description

      I have found this new issue during my test, whenever node is being removed then hint file for that node gets written and stays inside the hint directory forever. I debugged the code and found that it is due to the race condition between HintsWriteExecutor.java::flush and HintsWriteExecutor.java::closeWriter
      .

      Time t1 Node is down, as a result Hints are being written by HintsWriteExecutor.java::flush
      Time t2 Node is removed from cluster as a result it calls HintsService.java-exciseStore which removes hint files for the node being removed
      Time t3 Mutation stage keeps pumping Hints through HintService.java::write which again calls HintsWriteExecutor.java::flush and new orphan file gets created

      I was writing a new dtest for

      {CASSANDRA-13562, CASSANDRA-13308}

      and that helped me reproduce this new bug. I will submit patch for this new dtest later.

      I also tried following to check how this orphan hint file responds:
      1. I tried nodetool truncatehints <node> but it fails as node is no longer part of the ring
      2. I then tried nodetool truncatehints, that still doesn’t remove hint file because it is not yet included in the dispatchDequeue

      Reproducible steps:
      Please find dTest python file gossip_hang_test.py attached which reproduces this bug.

      Solution:
      This is due to race condition as mentioned above. Since HintsWriteExecutor.java creates thread pool with only 1 worker, so solution becomes little simple. Whenever we HintService.java::excise a host, just store it in-memory, and check for already evicted host inside HintsWriteExecutor.java::flush . If already evicted host is found then ignore hints.

      Jaydeep

        Attachments

        1. 13740-3.0.15.txt
          10 kB
          Jaydeepkumar Chovatia
        2. gossip_hang_test.py
          3 kB
          Jaydeepkumar Chovatia

          Activity

            People

            • Assignee:
              chovatia.jaydeep@gmail.com Jaydeepkumar Chovatia Assign to me
              Reporter:
              chovatia.jaydeep@gmail.com Jaydeepkumar Chovatia
              Authors:
              Jaydeepkumar Chovatia
              Reviewers:
              Aleksey Yeschenko

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment