HBase
  1. HBase
  2. HBASE-4128

Detect whether there was zookeeper ensemble, master or region server hanging from previous build

    Details

    • Type: Task Task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Quite often, we see unit test(s) time out after 15 minutes.
      One example was TestShell: https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/239/console

      This may be caused by zookeeper ensemble, master or region server hanging from previous build.
      We should detect (and terminate, if possible) the hanging zk ensemble, master or region server from previous build as the first step in current build.

        Activity

        Hide
        stack added a comment -

        Our 0.90 build was using 'ubuntu' as host to build on. This could be more than one machine. I changed it to be 'ubuntu2'. I also have the 0.90 build first run some shell commands – hostname, ulimit -a, and jps. We can see if any zk running.

        Show
        stack added a comment - Our 0.90 build was using 'ubuntu' as host to build on. This could be more than one machine. I changed it to be 'ubuntu2'. I also have the 0.90 build first run some shell commands – hostname, ulimit -a, and jps. We can see if any zk running.
        Hide
        Eric Charles added a comment -

        Should we also detect hanged master and region servers?

        Show
        Eric Charles added a comment - Should we also detect hanged master and region servers?
        Hide
        stack added a comment -

        @Eric Yes. We'll now jps as first thing we do before build. Lets see what that turns up next time we have a TestShell hang. If its hung master/regionserver, should show... or I suppose it won't really. We'll see the maven surefile process running but that should be clue enough.

        Show
        stack added a comment - @Eric Yes. We'll now jps as first thing we do before build. Lets see what that turns up next time we have a TestShell hang. If its hung master/regionserver, should show... or I suppose it won't really. We'll see the maven surefile process running but that should be clue enough.
        Hide
        Nicolas Liochon added a comment -

        Note as well that there is a bug in surefire (http://jira.codehaus.org/browse/SUREFIRE-773), and the java processes are not always killed when there is a timeout. So there could be more processes to kill than only the zk.

        Show
        Nicolas Liochon added a comment - Note as well that there is a bug in surefire ( http://jira.codehaus.org/browse/SUREFIRE-773 ), and the java processes are not always killed when there is a timeout. So there could be more processes to kill than only the zk.
        Hide
        Lars Hofhansl added a comment -

        N: Is this still an issue?

        Show
        Lars Hofhansl added a comment - N: Is this still an issue?
        Hide
        Nicolas Liochon added a comment -

        I would say yes, because the surefire issue is still there, and we still have dangling processes sometimes. As the processes are now (supposed to be) configured to run on free port, it's less an issue than it used to be, but still... The best/simplest way to fix this is to fix the surefire issue, so it's likely to be opened for a few months more...

        Show
        Nicolas Liochon added a comment - I would say yes, because the surefire issue is still there, and we still have dangling processes sometimes. As the processes are now (supposed to be) configured to run on free port, it's less an issue than it used to be, but still... The best/simplest way to fix this is to fix the surefire issue, so it's likely to be opened for a few months more...

          People

          • Assignee:
            Unassigned
            Reporter:
            Ted Yu
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development