Lucene - Core
  1. Lucene - Core
  2. LUCENE-4510

when a test's heart beats it should also throw up (dump stack of all threads)

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We've had numerous cases where tests were hung but the "operator" of that particular Jenkins instance struggles to properly get a stack dump for all threads and eg accidentally kills the process instead (rather awful that the same powerful tool "kill" can be used to get stack traces and to destroy the process...).

      Is there some way the test infra could do this for us, eg when it prints the HEARTBEAT message?

        Activity

        Hide
        Robert Muir added a comment -

        +1, this would eliminate operator error!

        Show
        Robert Muir added a comment - +1, this would eliminate operator error!
        Hide
        Dawid Weiss added a comment -

        I vaguely remember people on the mailing list expressed concerns even about the heartbeat messages so throwing stack traces at everyone would be probably a bit too much. I see a few options.

        1) Stack traces are dumped at test/suite timeout, so you could simply override the default and lower this timeout locally:

        For tests only:
        -Dtests.timeout=5000!

        For entire suites:
        -Dtests.timeoutSuite=10000!

        Note the '!' at the end which means the property should take precedence over annotations. The scale is millis.

        2) The above workaround has an obvious problem that the stack trace is not dumped repeatedly which can be extremely useful in telling which parts of the code a thread was locked in. In fact, randomized runner once HAD a snippet of code which was doing it automatically – on timeout it would probe stack traces of all threads a few times at random intervals and then it would print the 'diverging' paths for each thread (along with the constant part). I removed it because it wasn't used by anybody

        So... I could add yet another option that would dump stack traces on heartbeats but it would be turned off by default which I think is sensible. This option could also take a number which would mean roughly: "start dumping stack traces after the N-th heartbeat".

        Let me know if (2) sounds good to you.

        Show
        Dawid Weiss added a comment - I vaguely remember people on the mailing list expressed concerns even about the heartbeat messages so throwing stack traces at everyone would be probably a bit too much. I see a few options. 1) Stack traces are dumped at test/suite timeout, so you could simply override the default and lower this timeout locally: For tests only: -Dtests.timeout=5000! For entire suites: -Dtests.timeoutSuite=10000! Note the '!' at the end which means the property should take precedence over annotations. The scale is millis. 2) The above workaround has an obvious problem that the stack trace is not dumped repeatedly which can be extremely useful in telling which parts of the code a thread was locked in. In fact, randomized runner once HAD a snippet of code which was doing it automatically – on timeout it would probe stack traces of all threads a few times at random intervals and then it would print the 'diverging' paths for each thread (along with the constant part). I removed it because it wasn't used by anybody So... I could add yet another option that would dump stack traces on heartbeats but it would be turned off by default which I think is sensible. This option could also take a number which would mean roughly: "start dumping stack traces after the N-th heartbeat". Let me know if (2) sounds good to you.
        Hide
        Dawid Weiss added a comment -

        Oh, forgot – both props mentioned above are from randomized testing package so you'd have to add them to common-build.xml and pass them to <junit4:junit4> task. We could probably just replace all the props with a propertyset based on regexp so that all tests.* properties are passed automatically... but then it wouldn't be so explicit so maybe it's actually better the way it is.

        Show
        Dawid Weiss added a comment - Oh, forgot – both props mentioned above are from randomized testing package so you'd have to add them to common-build.xml and pass them to <junit4:junit4> task. We could probably just replace all the props with a propertyset based on regexp so that all tests.* properties are passed automatically... but then it wouldn't be so explicit so maybe it's actually better the way it is.
        Hide
        Robert Muir added a comment -

        I vaguely remember people on the mailing list expressed concerns even about the heartbeat messages so throwing stack traces at everyone would be probably a bit too much.

        But we have tests totally hanging and taking minutes of time. So I think their concerns can wait in line!

        On the other hand if we like the test hangs and want to keep them

        Show
        Robert Muir added a comment - I vaguely remember people on the mailing list expressed concerns even about the heartbeat messages so throwing stack traces at everyone would be probably a bit too much. But we have tests totally hanging and taking minutes of time. So I think their concerns can wait in line! On the other hand if we like the test hangs and want to keep them
        Hide
        Dawid Weiss added a comment -

        Ok, I'll add it to the heartbeat as an option. Will do it over the weekend.

        Show
        Dawid Weiss added a comment - Ok, I'll add it to the heartbeat as an option. Will do it over the weekend.
        Hide
        Michael McCandless added a comment -

        Thanks Dawid!

        Show
        Michael McCandless added a comment - Thanks Dawid!
        Hide
        Dawid Weiss added a comment -

        This is trickier than I thought. The problem is that heartbeats are in fact generated on the master side (not on the forked JVM) based on the activity in the event stream for a particular forked JVM (that's why it was possible to generate a heartbeat from an "unknown" location in the test execution timeline). The naming is confusing here, blame me.

        To generate a stack trace of the forked JVM we'd need to do it remotely (and know its PID) or do it locally (and propagate it back to the master). This isn't a trivial refactoring, I'll need some time to do it. In the mean time I'll release the fix for the shutdown hooks issue – this should help in getting those stalled runs to complete (with a suite timeout on forked JVMs and corresponding stack traces).

        Show
        Dawid Weiss added a comment - This is trickier than I thought. The problem is that heartbeats are in fact generated on the master side (not on the forked JVM) based on the activity in the event stream for a particular forked JVM (that's why it was possible to generate a heartbeat from an "unknown" location in the test execution timeline). The naming is confusing here, blame me. To generate a stack trace of the forked JVM we'd need to do it remotely (and know its PID) or do it locally (and propagate it back to the master). This isn't a trivial refactoring, I'll need some time to do it. In the mean time I'll release the fix for the shutdown hooks issue – this should help in getting those stalled runs to complete (with a suite timeout on forked JVMs and corresponding stack traces).
        Hide
        Robert Muir added a comment -

        Actually getting to the step of printing the PID in the heartbeat message would be really useful!

        Show
        Robert Muir added a comment - Actually getting to the step of printing the PID in the heartbeat message would be really useful!
        Hide
        Dawid Weiss added a comment -

        Funnily – that's the hardest part about it... there's just no API to get a PID of the forked process – I think I've seen a patch to JDK 1.8 somewhere that adds it but for now there's just none. It'd have to be platform-specific trickery to get it.

        Show
        Dawid Weiss added a comment - Funnily – that's the hardest part about it... there's just no API to get a PID of the forked process – I think I've seen a patch to JDK 1.8 somewhere that adds it but for now there's just none. It'd have to be platform-specific trickery to get it.
        Hide
        Robert Muir added a comment -

        http://jenkins.sd-datasolutions.de/job/Lucene-Solr-4.x-Linux/3089/console was another example of this.

        alternatively we could try to do something like hackup the parent process (ant) to respond to some signals in
        a way that helps prevent operator errors.

        Show
        Robert Muir added a comment - http://jenkins.sd-datasolutions.de/job/Lucene-Solr-4.x-Linux/3089/console was another example of this. alternatively we could try to do something like hackup the parent process (ant) to respond to some signals in a way that helps prevent operator errors.
        Hide
        Uwe Schindler added a comment -

        Actually getting to the step of printing the PID in the heartbeat message would be really useful!

        The PID is accessible! The parent process cannot get id (the PID of forked process), but the child JVM can get its own PID using the javax.management API. I have to lookup example code. So the child JVM could print it on starting up.

        Show
        Uwe Schindler added a comment - Actually getting to the step of printing the PID in the heartbeat message would be really useful! The PID is accessible! The parent process cannot get id (the PID of forked process), but the child JVM can get its own PID using the javax.management API. I have to lookup example code. So the child JVM could print it on starting up.
        Hide
        Uwe Schindler added a comment -

        Here is it:

        There exists no platform-independent way that can be guaranteed to work in all jvm implementations. ManagementFactory.getRuntimeMXBean().getName() looks like the best (closest) solution. It's short, and probably works in every implementation in wide use.

        On linux it returns a value like 12345@localhost (12345 being the process id). Beware though that according to the docs, there are no guarantees about this value:

        Returns the name representing the running Java virtual machine. The returned name string can be any arbitrary string and a Java virtual machine implementation can choose to embed platform-specific useful information in the returned name string. Each running virtual machine could have a different name.

        (from http://stackoverflow.com/questions/35842/how-can-a-java-program-get-its-own-process-id)

        Show
        Uwe Schindler added a comment - Here is it: There exists no platform-independent way that can be guaranteed to work in all jvm implementations. ManagementFactory.getRuntimeMXBean().getName() looks like the best (closest) solution. It's short, and probably works in every implementation in wide use. On linux it returns a value like 12345@localhost (12345 being the process id). Beware though that according to the docs, there are no guarantees about this value: Returns the name representing the running Java virtual machine. The returned name string can be any arbitrary string and a Java virtual machine implementation can choose to embed platform-specific useful information in the returned name string. Each running virtual machine could have a different name. (from http://stackoverflow.com/questions/35842/how-can-a-java-program-get-its-own-process-id )
        Hide
        Dawid Weiss added a comment -

        Yeah, it's the one. I can definitely add it to the framework so that it's reported in the logs, even if it's not guaranteed to return the PID.

        Show
        Dawid Weiss added a comment - Yeah, it's the one. I can definitely add it to the framework so that it's reported in the logs, even if it's not guaranteed to return the PID.
        Hide
        Uwe Schindler added a comment - - edited

        Maybe instead of printing "J0" and "J1",... just always print this identification string in all messages? It is platform specific but defined to be unique, so much better than J0, J1,...

        Show
        Uwe Schindler added a comment - - edited Maybe instead of printing "J0" and "J1",... just always print this identification string in all messages? It is platform specific but defined to be unique, so much better than J0, J1,...
        Hide
        Dawid Weiss added a comment -

        I can probably print those identifiers when a VM doesn't respond (and is heartbeating) and once at start? We don't know how long these will be and because they're variable length they can become pretty verbose.

        Show
        Dawid Weiss added a comment - I can probably print those identifiers when a VM doesn't respond (and is heartbeating) and once at start? We don't know how long these will be and because they're variable length they can become pretty verbose.

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development