Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-16668

Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • 4.1, 4.0-rc2, 4.0
    • Local/Other
    • None

    Description

      A difficult-to-hit race condition exists in changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool size from 0 -> 4 which results in the test failing like so:

      junit.framework.AssertionFailedError: Test tasks did not hit max concurrency goal expected:<true> but was:<false>junit.framework.AssertionFailedError: Test tasks did not hit max concurrency goal expected:<true> but was:<false> at org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) at org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)

      I can hit this issue maybe 2/3 times for every 100 invocations of the unit test.

      The issue that causes the failure is that if tasks are still enqueued when the maximum pool size is set to zero and if all of the SEPWorker threads enter the STOP state before the pool size is bumped to 4, then no SEPWorker threads will be spun up to service the task queue. This causes the above error.

      Why don't we spin up SEPWorker threads when enqueing tasks? Because of the guard logic in addTask: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121

      In this scenario taskPermits will not be zero (because we have tasks on the queue) so we never call maybeStartSpinningWorker().

      A trick to make this issue much easier to hit is to insert a Thread.sleep(500) immediately after setting the pool size to zero. This has the effect of guaranteeing that all SEPWorker threads will be STOP'd before enqueueing more work.

      Here's a fix that attempts to spin up an SEPWorker whenever we grow the number of work permits: https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01

      Attachments

        Issue Links

          Activity

            People

              mfleming Matt Fleming
              mfleming Matt Fleming
              Matt Fleming
              Andres de la Peña, Ekaterina Dimitrova, Jon Meredith
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: