Uploaded image for project: 'Maven Surefire'
  1. Maven Surefire
  2. SUREFIRE-1719

Race condition results in "VM crash or System.exit called?" failure

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, 3.0.0-M1, 3.0.0-M3
    • Fix Version/s: 3.0.0-M5
    • Component/s: Maven Surefire Plugin
    • Labels:
      None

      Description

      After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, unit tests started to fail with the message "ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called?"

      For reference, the command I am using to verify this problem is "mvn -am -pl modules/common clean package" and the surefire configuration is:

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <configuration>
          <includes>
            <include>**/*Test.class</include>
            <include>**/*Tests.class</include>
          </includes>
          <!-- dCache uses the singleton anti-pattern in way
          too many places. That unfortunately means we have
          to accept the overhead of forking each test run. -->
          <forkCount>1C</forkCount>
          <reuseForks>false</reuseForks>
        </configuration>
      {{ </plugin>}}

      [The complete pom.xml is attached.]

      This problem is not always present. On our build machine, I've seen the problem appear 6 out of 10 times when running the above mvn command. There is (apparently) little that seems to influence whether the build will succeed or fail.

      [I've attached the complete output from running the above mvn command, both the normal output and including the -e -X options.]

      The problem seems to appear only on machines with a "large" number of cores. Our build machine has 24 cores, and I've seen a report of a similar problem where building dCache on a 48 core machine. On the other side, I have been unable to reproduce the problem with my desktop machine (8 core) or on my laptop (4 cores).

      What seems to matter is the number of actually running JVM instances.

      I have not been able to reproduce the problem by increasing the forkCount on a machine with a small number of cores. However, I've noticed that, on an 8 core machine, increasing the forkCount does not actually result in that many more JVM instances running.

      Similarly, experience shows that reducing the number of concurrent JVM instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood of a problem below 10% (0 failures with 10 builds) on our build machine. On this machine, the default configuration would try to run 24 JVM instances concurrently (forkCount of "1C" on a 24 core machine).

      The problem appears to have been introduced in surefire v2.20. When building with surefire v2.19.1, the above mvn command is always successful on our build machine. Building with surefire v2.20 results in intermittent failures (~60% failure rate).

      Using git bisection (and with the criterion for "good" as zero failures in 10 build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 Acknowledge normal exit of JVM and drain shared memory between processes" is the first commit where surefire has this intermittent failure behaviour.

      From a causal scan through the patch, my guess is that the BYE_ACK support it introduces is somehow racy (for example, reading or updating a field-member outside of a monitor) and problems are triggered if there are a large number of JVMs exiting concurrently. So, with increased number of concurrent JVMs there is an increased risk of a thread loosing the race, and so triggering this error.

      Such a problem would be consistent with observed behaviour. However, I don't have any strong evidence that this is what is happening.

        Attachments

        1. pom.xml
          59 kB
          Paul Millar
        2. build-error-debug.out
          566 kB
          Paul Millar
        3. build.out
          18 kB
          Paul Millar

          Activity

            People

            • Assignee:
              tibordigana Tibor Digana
              Reporter:
              paulmillar Paul Millar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: