Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.15.0
    • Fix Version/s: hudson
    • Component/s: build
    • Labels:
      None

      Description

      Hudson should kill long running tests. (I believe it is supposed to but doesn't quite seem to do the job if the test is really hung up).

      It would be nice if, when the timer goes off, Hudson did a

      kill -QUIT

      (to try to get a thread dump) and then followed that with a

      kill -9

      (See the section "Killing a hung test" at http://wiki.apache.org/lucene-hadoop/HudsonBuildServer )

        Activity

        Jim Kellerman created issue -
        Hide
        Nigel Daley added a comment -

        An RFE has been filed with Hudson:
        https://hudson.dev.java.net/issues/show_bug.cgi?id=789

        But I think the problem is with Junit. JUnit is supposed to timeout a test if it is taking longer than 15 minutes. This doesn't seem to work reliably if a test gets really 'wedged'.

        Note too that having Hudson timeout a patch build won't have the effect you desire. It will simply hang the patch queue since the 'current' link on the filesystem to the patch being tested won't get removed.

        Show
        Nigel Daley added a comment - An RFE has been filed with Hudson: https://hudson.dev.java.net/issues/show_bug.cgi?id=789 But I think the problem is with Junit. JUnit is supposed to timeout a test if it is taking longer than 15 minutes. This doesn't seem to work reliably if a test gets really 'wedged'. Note too that having Hudson timeout a patch build won't have the effect you desire. It will simply hang the patch queue since the 'current' link on the filesystem to the patch being tested won't get removed.
        Hide
        Jim Kellerman added a comment -

        On Wed, 2007-09-05 at 14:21 -0700, Nigel Daley (JIRA) wrote:
        > But I think the problem is with Junit. JUnit is supposed to timeout a test if it is
        > taking longer than 15 minutes. This doesn't seem to work reliably if a test gets really
        > 'wedged'.

        Understood. But how difficult would it be to start a subprocess from the build just prior to starting a test, and have it monitor the test and kill it if it takes too long?

        (See the section "Killing a hung test" at http://wiki.apache.org/lucene-hadoop/HudsonBuildServer )

        Once the test has been killed or if the test exits normally, the subprocess would just exit. The task that could do this is a pretty simple piece of shell-scripting.

        When I have killed just the process running the test manually, the build resumes.

        If we did this, I don't think we'd need a timeout on the whole build, because the reason builds take a long time is due to a hung test.

        > Note too that having Hudson timeout a patch build won't have the effect you desire.
        > It will simply hang the patch queue since the 'current' link on the filesystem to the
        > patch being tested won't get removed.

        I wasn't really suggesting killing the whole build. In my experience just doing a kill -9 on the stuck test kills the test, and the build just resumes.

        Show
        Jim Kellerman added a comment - On Wed, 2007-09-05 at 14:21 -0700, Nigel Daley (JIRA) wrote: > But I think the problem is with Junit. JUnit is supposed to timeout a test if it is > taking longer than 15 minutes. This doesn't seem to work reliably if a test gets really > 'wedged'. Understood. But how difficult would it be to start a subprocess from the build just prior to starting a test, and have it monitor the test and kill it if it takes too long? (See the section "Killing a hung test" at http://wiki.apache.org/lucene-hadoop/HudsonBuildServer ) Once the test has been killed or if the test exits normally, the subprocess would just exit. The task that could do this is a pretty simple piece of shell-scripting. When I have killed just the process running the test manually, the build resumes. If we did this, I don't think we'd need a timeout on the whole build, because the reason builds take a long time is due to a hung test. > Note too that having Hudson timeout a patch build won't have the effect you desire. > It will simply hang the patch queue since the 'current' link on the filesystem to the > patch being tested won't get removed. I wasn't really suggesting killing the whole build. In my experience just doing a kill -9 on the stuck test kills the test, and the build just resumes.
        Hide
        Doug Cutting added a comment -

        > kill -9 on the stuck test kills the test

        I've never even had to use -9, but rather just 'kill', and build proceeds w/o issue, leaving no stray JVMs around.

        Show
        Doug Cutting added a comment - > kill -9 on the stuck test kills the test I've never even had to use -9, but rather just 'kill', and build proceeds w/o issue, leaving no stray JVMs around.
        Doug Cutting made changes -
        Field Original Value New Value
        Fix Version/s 0.15.0 [ 12312565 ]
        Hide
        Nigel Daley added a comment -

        Hmm, there seems to a problem on Solaris with UNIXProcess.forkAndExec which JUnit must use:
        http://hudson.gotdns.com/wiki/display/HUDSON/Solaris+Issue+6276483

        I applied the suggested workaround here:
        http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6276483
        in jre/lib/security/java.security file and restarted Hudson last week. This has fixed Hadoop-Patch-Admin build hanging when it was gettting going. I wonder if it will fix this problem too. If no tests hang over the next week, I'll close this issue.

        Show
        Nigel Daley added a comment - Hmm, there seems to a problem on Solaris with UNIXProcess.forkAndExec which JUnit must use: http://hudson.gotdns.com/wiki/display/HUDSON/Solaris+Issue+6276483 I applied the suggested workaround here: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6276483 in jre/lib/security/java.security file and restarted Hudson last week. This has fixed Hadoop-Patch-Admin build hanging when it was gettting going. I wonder if it will fix this problem too. If no tests hang over the next week, I'll close this issue.
        Nigel Daley made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s hudson [ 12312940 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Jim Kellerman
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development