[CASSANDRA-16668] Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.0-rc2, 4.0, 4.1-alpha1, 4.1
Component/s: Local/Other
Labels:
None

Bug Category:
Code - Bug - Unclear Impact
Severity:
Low
Complexity:
Normal
Discovered By:
Unit Test
Platform:

All
Impacts:

None
Source Control Link:

https://github.com/apache/cassandra/commit/8cd02afce972ecaf0e0cf0fe09c610d67d9af9c5
Test and Documentation Plan:

Hide

https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01

Show
https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01

Description

A difficult-to-hit race condition exists in changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool size from 0 -> 4 which results in the test failing like so:

junit.framework.AssertionFailedError: Test tasks did not hit max concurrency goal expected:<true> but was:<false>junit.framework.AssertionFailedError: Test tasks did not hit max concurrency goal expected:<true> but was:<false> at org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) at org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)

I can hit this issue maybe 2/3 times for every 100 invocations of the unit test.

The issue that causes the failure is that if tasks are still enqueued when the maximum pool size is set to zero and if all of the SEPWorker threads enter the STOP state before the pool size is bumped to 4, then no SEPWorker threads will be spun up to service the task queue. This causes the above error.

Why don't we spin up SEPWorker threads when enqueing tasks? Because of the guard logic in addTask: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121

In this scenario taskPermits will not be zero (because we have tasks on the queue) so we never call maybeStartSpinningWorker().

A trick to make this issue much easier to hit is to insert a Thread.sleep(500) immediately after setting the pool size to zero. This has the effect of guaranteeing that all SEPWorker threads will be STOP'd before enqueueing more work.

Here's a fix that attempts to spin up an SEPWorker whenever we grow the number of work permits: https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01

Attachments

Issue Links

is related to

CASSANDRA-15709 SEPExecutorTest changingMaxWorkersMeetsConcurrencyGoalsTest failure

Resolved

Activity

People

Assignee:: Matt Fleming

Reporter:: Matt Fleming

Authors:: Matt Fleming

Reviewers:: Andres de la Peña, Ekaterina Dimitrova, Jon Meredith

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/May/21 21:13

Updated:: 27/May/22 19:25

Resolved:: 21/May/21 18:34