[CASSANDRA-8815] Race in sstable ref counting during streaming failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 2.0.13
Component/s: None
Labels:
None

Severity:
Normal

Description

We have a seen a machine in Prod whose all read threads are blocked(spinning) on trying to acquire the reference lock on stables. There are also some stream sessions which are doing the same.
On looking at the heap dump, we could see that a live sstable which is part of the View has a ref count = 0. This sstable is also not compacting or is part of any failed compaction.

On looking through the code, we could see that if ref goes to zero and the stable is part of the View, all reader threads will spin forever.

On further looking through the code of streaming, we could see that if StreamTransferTask.complete is called after closeSession has been called due to error in OutgoingMessageHandler, it will double decrement the ref count of an sstable.

This race can happen and we see through exception in logs that closeSession was triggered by OutgoingMessageHandler.

The fix for this is very simple i think. In StreamTransferTask.abort, we can remove a file from "files” before decrementing the ref count. This will avoid this race.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

8815.txt
17/Feb/15 17:34
1 kB
Benedict Elliott Smith

Issue Links

relates to

CASSANDRA-8829 Add extra checks to catch SSTable ref counting bugs

Resolved

Activity

People

Assignee:: Benedict Elliott Smith

Reporter:: Sankalp Kohli

Authors:: Benedict Elliott Smith

Reviewers:: Sankalp Kohli

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 16/Feb/15 22:27

Updated:: 16/Apr/19 09:31

Resolved:: 19/Feb/15 11:55