After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, unit tests started to fail with the message "ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called?"
For reference, the command I am using to verify this problem is "mvn -am -pl modules/common clean package" and the surefire configuration is:
<!-- dCache uses the singleton anti-pattern in way
too many places. That unfortunately means we have
to accept the overhead of forking each test run. -->
[The complete pom.xml is attached.]
This problem is not always present. On our build machine, I've seen the problem appear 6 out of 10 times when running the above mvn command. There is (apparently) little that seems to influence whether the build will succeed or fail.
[I've attached the complete output from running the above mvn command, both the normal output and including the -e -X options.]
The problem seems to appear only on machines with a "large" number of cores. Our build machine has 24 cores, and I've seen a report of a similar problem where building dCache on a 48 core machine. On the other side, I have been unable to reproduce the problem with my desktop machine (8 core) or on my laptop (4 cores).
What seems to matter is the number of actually running JVM instances.
I have not been able to reproduce the problem by increasing the forkCount on a machine with a small number of cores. However, I've noticed that, on an 8 core machine, increasing the forkCount does not actually result in that many more JVM instances running.
Similarly, experience shows that reducing the number of concurrent JVM instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood of a problem below 10% (0 failures with 10 builds) on our build machine. On this machine, the default configuration would try to run 24 JVM instances concurrently (forkCount of "1C" on a 24 core machine).
The problem appears to have been introduced in surefire v2.20. When building with surefire v2.19.1, the above mvn command is always successful on our build machine. Building with surefire v2.20 results in intermittent failures (~60% failure rate).
Using git bisection (and with the criterion for "good" as zero failures in 10 build attempts), I was able to determine that commit da7ff6aa2 "
SUREFIRE-1342 Acknowledge normal exit of JVM and drain shared memory between processes" is the first commit where surefire has this intermittent failure behaviour.
From a causal scan through the patch, my guess is that the BYE_ACK support it introduces is somehow racy (for example, reading or updating a field-member outside of a monitor) and problems are triggered if there are a large number of JVMs exiting concurrently. So, with increased number of concurrent JVMs there is an increased risk of a thread loosing the race, and so triggering this error.
Such a problem would be consistent with observed behaviour. However, I don't have any strong evidence that this is what is happening.