Lucene - Core
  1. Lucene - Core
  2. LUCENE-5786

Unflushed/ truncated events file (hung testing subprocess)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, Trunk
    • Component/s: general/test
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This has happened several times on Jenkins, typically on SSLMigrationTest.testDistribSearch, but probably on other tests as well.

      The symptom is: the test framework never terminates, it also reports an incorrect hung test.

      The problem is that the actual forked JVM is hung on reading stdin, waiting for the next test suite (no test thread is present); the master process is hung on receiving data from the forked jvm (both the events file and stdout spill is truncated in the middle of a test). The last output is:

      [
        "APPEND_STDERR",
        {
          "chunk": "612639 T30203 oasu.DefaultSolrCoreState.doRecovery Running recovery - first canceling any ongoing recovery%0A"
        }
      ]
      
      [
        "APPEND_STDERR"
      

      Overall, it looks insane – there are flushes after each test completes (normally or not), there are tests following the one that last reported output and before dynamic suites on stdin.

      I have no idea. The best explanation is insane – looks like the test thread just died in the middle of executing Java code...

        Issue Links

          Activity

          Hide
          Dawid Weiss added a comment -

          I think this may actually be a bug in RR code. Here's why – writing is done via RandomAccessFile and it used to have flush() method but at some point it was commented out:

             @Override
             public void flush() throws IOException {
          -    raf.getChannel().force(true);
          +    // This was causing intermittent channel invalidations on Windows for
          +    // no apparent reason. Also, it shouldn't be a problem if we don't sync
          +    // with the disk (and use OS cache only)?
          +    // raf.getChannel().force(true);
             }
          

          So it either randomly fails on Windows or it will not sync properly on FreeBSD. Nice. I'll revert to simple FileOutputStream and see if this improves things.

          Show
          Dawid Weiss added a comment - I think this may actually be a bug in RR code. Here's why – writing is done via RandomAccessFile and it used to have flush() method but at some point it was commented out: @Override public void flush() throws IOException { - raf.getChannel().force( true ); + // This was causing intermittent channel invalidations on Windows for + // no apparent reason. Also, it shouldn't be a problem if we don't sync + // with the disk (and use OS cache only)? + // raf.getChannel().force( true ); } So it either randomly fails on Windows or it will not sync properly on FreeBSD. Nice. I'll revert to simple FileOutputStream and see if this improves things.
          Hide
          Dawid Weiss added a comment -

          I reverted to simple output stream instead of using RAF. Let's see how it goes. Will update to RR 2.1.4 in a second.

          https://github.com/carrotsearch/randomizedtesting/issues/170

          Show
          Dawid Weiss added a comment - I reverted to simple output stream instead of using RAF. Let's see how it goes. Will update to RR 2.1.4 in a second. https://github.com/carrotsearch/randomizedtesting/issues/170
          Hide
          ASF subversion and git services added a comment -

          Commit 1605024 from Dawid Weiss in branch 'dev/trunk'
          [ https://svn.apache.org/r1605024 ]

          LUCENE-5786: Unflushed/ truncated events file (hung testing subprocess)

          Show
          ASF subversion and git services added a comment - Commit 1605024 from Dawid Weiss in branch 'dev/trunk' [ https://svn.apache.org/r1605024 ] LUCENE-5786 : Unflushed/ truncated events file (hung testing subprocess)
          Hide
          ASF subversion and git services added a comment -

          Commit 1605025 from Dawid Weiss in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1605025 ]

          LUCENE-5786: Unflushed/ truncated events file (hung testing subprocess). Point at Sonatype's https repo directly.

          Show
          ASF subversion and git services added a comment - Commit 1605025 from Dawid Weiss in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1605025 ] LUCENE-5786 : Unflushed/ truncated events file (hung testing subprocess). Point at Sonatype's https repo directly.
          Hide
          Dawid Weiss added a comment -

          Switching to FileOutputStream doesn't do anything – the next build still hung. I'm running out of ideas. The event file is truncated at an impossible place, the test thread gone.

          Show
          Dawid Weiss added a comment - Switching to FileOutputStream doesn't do anything – the next build still hung. I'm running out of ideas. The event file is truncated at an impossible place, the test thread gone.
          Hide
          Dawid Weiss added a comment -

          I think the problem underlying that weird behavior may really be related to terminating methods inside native code and their unpredictable behavior on FreeBSD – see SOLR-6204.

          Show
          Dawid Weiss added a comment - I think the problem underlying that weird behavior may really be related to terminating methods inside native code and their unpredictable behavior on FreeBSD – see SOLR-6204 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1606002 from Dawid Weiss in branch 'dev/trunk'
          [ https://svn.apache.org/r1606002 ]

          LUCENE-5786: adding more debugging to the test framework.

          Show
          ASF subversion and git services added a comment - Commit 1606002 from Dawid Weiss in branch 'dev/trunk' [ https://svn.apache.org/r1606002 ] LUCENE-5786 : adding more debugging to the test framework.
          Hide
          ASF subversion and git services added a comment -

          Commit 1606003 from Dawid Weiss in branch 'dev/trunk'
          [ https://svn.apache.org/r1606003 ]

          LUCENE-5786: adding more debugging to the test framework.

          Show
          ASF subversion and git services added a comment - Commit 1606003 from Dawid Weiss in branch 'dev/trunk' [ https://svn.apache.org/r1606003 ] LUCENE-5786 : adding more debugging to the test framework.
          Hide
          ASF subversion and git services added a comment -

          Commit 1606005 from Dawid Weiss in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1606005 ]

          LUCENE-5786: adding more debugging to the test framework.

          Show
          ASF subversion and git services added a comment - Commit 1606005 from Dawid Weiss in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1606005 ] LUCENE-5786 : adding more debugging to the test framework.
          Hide
          Dawid Weiss added a comment -

          I've temporarily disabled ALL active configurations on freebsd jenkins. That is, the following:

          • Lucene-Solr-NightlyTests-trunk
          • Lucene-Artifacts-trunk
          • Lucene-Solr-Clover-4.x
          • Lucene-Solr-Clover-trunk
          • Lucene-Solr-Maven-4.x
          • Lucene-Solr-Maven-trunk
          • Lucene-Solr-NightlyTests-4.x
          • Lucene-Artifacts-4.x
          • Lucene-Solr-SmokeRelease-4.x
          • Lucene-Solr-SmokeRelease-trunk
          • Lucene-Solr-Tests-4.x-Java7
          • Lucene-Solr-Tests-trunk-Java7
          • Solr-Artifacts-4.x
          • Solr-Artifacts-trunk

          Most of these didn't have a successful run in what seems like months, so I don't think it's a big problem. Still digging what's causing the death of the test thread (apart from socket interrupt issue we know about there's still something odd about those tests on freebsd).

          Show
          Dawid Weiss added a comment - I've temporarily disabled ALL active configurations on freebsd jenkins. That is, the following: Lucene-Solr-NightlyTests-trunk Lucene-Artifacts-trunk Lucene-Solr-Clover-4.x Lucene-Solr-Clover-trunk Lucene-Solr-Maven-4.x Lucene-Solr-Maven-trunk Lucene-Solr-NightlyTests-4.x Lucene-Artifacts-4.x Lucene-Solr-SmokeRelease-4.x Lucene-Solr-SmokeRelease-trunk Lucene-Solr-Tests-4.x-Java7 Lucene-Solr-Tests-trunk-Java7 Solr-Artifacts-4.x Solr-Artifacts-trunk Most of these didn't have a successful run in what seems like months, so I don't think it's a big problem. Still digging what's causing the death of the test thread (apart from socket interrupt issue we know about there's still something odd about those tests on freebsd).
          Hide
          Uwe Schindler added a comment -

          Would it be possible to keep the artifacts tasks alive? Those do not run tests, but they provide (Maven-) artifacts to our users.

          Show
          Uwe Schindler added a comment - Would it be possible to keep the artifacts tasks alive? Those do not run tests, but they provide (Maven-) artifacts to our users.
          Hide
          Dawid Weiss added a comment -

          I will re-enable all test plans I disabled once I'm done testing. I wanted a faster build cycle. It shouldn't take too long (if I don't find it until Monday I'll probably give up).

          Show
          Dawid Weiss added a comment - I will re-enable all test plans I disabled once I'm done testing. I wanted a faster build cycle. It shouldn't take too long (if I don't find it until Monday I'll probably give up).
          Hide
          Dawid Weiss added a comment -

          The problem is caused by stack overflow in Solr code (SOLR-6213) which in turn breaks the forked JVM's communication layer with the master test controller. Socket interrupt issue is probably triggering the bug.

          Unfixable at test framework layer I think. I'll try to improve logging and make the forked JVM at least quit in such a scenario.

          Show
          Dawid Weiss added a comment - The problem is caused by stack overflow in Solr code ( SOLR-6213 ) which in turn breaks the forked JVM's communication layer with the master test controller. Socket interrupt issue is probably triggering the bug. Unfixable at test framework layer I think. I'll try to improve logging and make the forked JVM at least quit in such a scenario.
          Hide
          Uwe Schindler added a comment -

          I updated the OpenJDK port on lucene-zones.apache.org.

          Show
          Uwe Schindler added a comment - I updated the OpenJDK port on lucene-zones.apache.org.
          Hide
          ASF subversion and git services added a comment -

          Commit 1607058 from Dawid Weiss in branch 'dev/trunk'
          [ https://svn.apache.org/r1607058 ]

          LUCENE-5786: Unflushed/ truncated events file (hung testing subprocess). Updating RR to 2.1.6

          Show
          ASF subversion and git services added a comment - Commit 1607058 from Dawid Weiss in branch 'dev/trunk' [ https://svn.apache.org/r1607058 ] LUCENE-5786 : Unflushed/ truncated events file (hung testing subprocess). Updating RR to 2.1.6
          Hide
          ASF subversion and git services added a comment -

          Commit 1607060 from Dawid Weiss in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1607060 ]

          LUCENE-5786: Unflushed/ truncated events file (hung testing subprocess). Updating RR to 2.1.6

          Show
          ASF subversion and git services added a comment - Commit 1607060 from Dawid Weiss in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1607060 ] LUCENE-5786 : Unflushed/ truncated events file (hung testing subprocess). Updating RR to 2.1.6
          Hide
          Dawid Weiss added a comment -

          I upgraded to RR 2.1.6; it should halt the forked JVM on any communication with the master now so no more hangs, hopefully.

          Show
          Dawid Weiss added a comment - I upgraded to RR 2.1.6; it should halt the forked JVM on any communication with the master now so no more hangs, hopefully.

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              Dawid Weiss
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development