We open handles to the index file and data file within an SSTableScanner and aren't participating in SSTR refcounting. SSTR.releaseReference attempts to delete these two files within the SSTableDeletingTask, giving rise to the error messages we're seeing when there's still SSTableScanners with those files open. This doesn't present on trunk in windows as FILE_SHARE_DELETE w/nio.2 allows files to be deleted w/other handles open to them transparently as in linux.
w/regards to this particular patch introducing (more) errors related to compaction and sstable deletion, it's a timing issue and there's a few ways I've experimented with to address this. All of them work but have different compromises. In order of least to most invasive:
- roll back the scoped-resource access in CompactionTask and target a scanners.close() right before
cfs.getDataTracker().markCompactedSSTablesReplaced(oldSStables, newSStables, compactionType);
This partially defeats the purpose of the earlier patch.
- move the call to markCompactedSSTablesReplaced to the end of CompactionTask.runWith after closing scope on the try block that opened them. This potentially delays said marking until stat calculation and logging is completed but scopes the resource access.
- have SSTableScanners acquireReference() and releaseReference() on their parent sstables on ctor and close() calls respectively
I have mixed feelings about all three options. I feel like option #3 is the most "correct" approach but it's also the most heavyweight by far and most prone to introduction of subtle bugs as it's imposing a new constraint on ref counting w/regards to SSTableScanner lifetimes. With regard to us trying to tighten up and reduce manual ref-counting this approach also feels like a significant step back.
For now (in 2.1.X land), I'm comfortable with rolling back the try w/resource usage on CompactionTask.runWith and then we think about this further, perhaps in a separate ticket if we feel it warranted. Digging into the unit tests on the 2.1 branch, it's apparent that this type of problem is pretty widespread on Windows as I get the same failures in many other unit tests (as far back as pre
CASSANDRA-6916 days). We may need to file these errors under "Cassandra-2.1 on Windows is beta, this is fixed on trunk", keeping in mind that these issues resolve once the scanners are closed. Might be worth considering dropping the logging severity on deletion failure from error to warn as it's going to be common on 2.1 currently.
Having said that - option 3 has a pretty profound impact on the 2.1 branch w/regards to Windows:
$ grep 'Unable to delete' 2.1_head.testresults | wc -l
$ grep 'Unable to delete' opt3_heavy.testresults | wc -l
We go from 9 actual test failures to 8 w/this change on Windows, and all currently passing unit tests continue to pass as expected on linux. I put in some logic to check if we use an SSTableScanner after the corresponding SSTR goes through the tidy process and, at least in all our unit tests, we don't appear to try and use a scanner after deleting the underlying SSTR.
Attaching both a conservative (option 1) and invasive (option 3) patch. My vote's for 1 but think there's merit in considering the implications of 3 and our current refcounting (and the meta purpose it serves).