Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 4.0
    • Component/s: None
    • Labels:
      None

      Description

      At the top of CompactionManager#performAntiCompaction, the parent repair session is loaded, if the session can't be found, a RuntimeException is thrown. This can happen if a participant is evicted after the IR prepare message is received, but before the anticompaction starts. This exception is thrown outside of the try/finally block that guards the sstable and lifecycle transaction, causing them to leak, and preventing the sstables from ever being removed from View.compacting.

        Activity

        Hide
        bdeggleston Blake Eggleston added a comment - - edited

        Branch below. This also fixes 2 edge cases I found while diagnosing this, adds some validation to CompactionTask, and tests them

        1. If a promotion/demotion compaction fails, none of the sstables that have had their repairedAt/pendingRepair value changed prior to the failure were being moved to the correct strategy.
        2. Removal of the compaction strategy for a given repair session doesn't check that the strategy is empty before removing it. If there are sstables still in the strategy, they will be left in compaction limbo until the node is bounced (or the compaction strategy is reloaded)
        3. CompactionTask wasn't validating that repaired/unrepaired/pending repair sstables weren't being compacted together.

        trunk
        utests

        Show
        bdeggleston Blake Eggleston added a comment - - edited Branch below. This also fixes 2 edge cases I found while diagnosing this, adds some validation to CompactionTask, and tests them 1. If a promotion/demotion compaction fails, none of the sstables that have had their repairedAt/pendingRepair value changed prior to the failure were being moved to the correct strategy. 2. Removal of the compaction strategy for a given repair session doesn't check that the strategy is empty before removing it. If there are sstables still in the strategy, they will be left in compaction limbo until the node is bounced (or the compaction strategy is reloaded) 3. CompactionTask wasn't validating that repaired/unrepaired/pending repair sstables weren't being compacted together. trunk utests
        Hide
        aweisberg Ariel Weisberg added a comment -

        So there is the issue here https://github.com/apache/cassandra/compare/trunk...bdeggleston:13688?expand=1#diff-d4e3b82e9bebfd2cb466b4a30af07fa4R610 we talked about where it's using the wrong result value to decide whether to clean up.

        Maybe run the dtests, but otherwise it looks good.

        Show
        aweisberg Ariel Weisberg added a comment - So there is the issue here https://github.com/apache/cassandra/compare/trunk...bdeggleston:13688?expand=1#diff-d4e3b82e9bebfd2cb466b4a30af07fa4R610 we talked about where it's using the wrong result value to decide whether to clean up. Maybe run the dtests, but otherwise it looks good.
        Hide
        bdeggleston Blake Eggleston added a comment -

        Pushed up fix for that here.

        Just for posterity, the problem being fixed here is that submitIfRunning will return a cancelled future if the executor is shutdown. However, we weren't checking the returned future, we were checking the submitted task, which shouldn't ever be cancelled.

        new utests
        dtests

        Show
        bdeggleston Blake Eggleston added a comment - Pushed up fix for that here . Just for posterity, the problem being fixed here is that submitIfRunning will return a cancelled future if the executor is shutdown. However, we weren't checking the returned future, we were checking the submitted task, which shouldn't ever be cancelled. new utests dtests
        Hide
        aweisberg Ariel Weisberg added a comment -

        First dtest run timed out. dtests ran here https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/127/
        Does that materialized view test look like anything to you? I don't recall seeing it fail recently. Don't see it failing in the last 30 builds on trunk. I kicked off the dtests again.

        Show
        aweisberg Ariel Weisberg added a comment - First dtest run timed out. dtests ran here https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/127/ Does that materialized view test look like anything to you? I don't recall seeing it fail recently. Don't see it failing in the last 30 builds on trunk. I kicked off the dtests again.
        Show
        aweisberg Ariel Weisberg added a comment - Will we ever get a clean run of this? https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/142/

          People

          • Assignee:
            bdeggleston Blake Eggleston
            Reporter:
            bdeggleston Blake Eggleston
            Reviewer:
            Ariel Weisberg
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development