Can anyone comment on the risk of a user (such as myself) backporting this fix and patching locally? The code that was changed in the patch looks identical in 1.1.11.
We have a situation where a column family with lots of deletes running under leveled compaction. The validation doesn't take too long, but afterwards we get 2k compaction tasks that takes several hours to run, when really there shouldn't be any inconsistency. What I suspect is happening is as tombstones are getting gc_graced they are compacted away on some nodes and not others at the time repair is run. I suspect the majority of the 2k compactions are gc_graced tombstones getting back in-sync.
I'm setting up a test environment with baseline data, going to reproduce the repair, reset to baseline, and re-run the repair with this patch to see if this is indeed the issue. This might take a few days to setup and run.
Cassandra is mission and business critical for us. Moving to 1.2 will take some time, as we should setup a test environment, practice migrations and test. We also use the ByteOrderedPartitioner, which in general concerns me as its not the most popular use of Cassandra, and maybe a source of issues as its pounded on less by the general user community.