Uploaded image for project: 'Tajo'
  1. Tajo
  2. TAJO-294

Removing dead workers from the live worker list

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently, even if a worker does not work properly,
      it remains as live and and participates in a plan.
      Web management console also displays the worker as live.
      It should be set as dead, and displaying last heartbeat time in web management console is probably helpful.

      1. TAJO-294.patch
        15 kB
        Keuntae Park
      2. TAJO-294.patch_2
        15 kB
        Keuntae Park
      3. TAJO-294-3.patch
        16 kB
        Jihoon Son

        Activity

        Hide
        sirpkt Keuntae Park added a comment -

        I've uploaded patch for the issue.

        • When TajoMaster(ResourceManager) does not receive the heartbeat from a worker within time specified in tajo.worker.heartbeat.timeout, it changes the status of the worker as dead.
        • If the dead worker resumes to send heartbeat again, TajoMaster restores the status as live.
        • Add heartbeat time in web management console.
        • Fix compile error of catalogview.jsp
          tableDesc.getMeta().getStat() -> tableDesc.getStats()
        Show
        sirpkt Keuntae Park added a comment - I've uploaded patch for the issue. When TajoMaster(ResourceManager) does not receive the heartbeat from a worker within time specified in tajo.worker.heartbeat.timeout, it changes the status of the worker as dead. If the dead worker resumes to send heartbeat again, TajoMaster restores the status as live. Add heartbeat time in web management console. Fix compile error of catalogview.jsp tableDesc.getMeta().getStat() -> tableDesc.getStats()
        Hide
        jihoonson Jihoon Son added a comment -

        I'll look at the patch.

        Show
        jihoonson Jihoon Son added a comment - I'll look at the patch.
        Hide
        jihoonson Jihoon Son added a comment -

        It is a great work, but there are some minor issues.

        • First of all, when TajoWorkerResourceManager.close() is called, WorkerMonitor.interrupt() should be called instead of WorkerMonitor.interrupted().
        • Also, it will be better that class names represent their properties. So, how about change the class name of WorkerMonitor to WorkerMonitorThread like WorkerResourceAllocationThread?
        • I got the following message while applying the patch. Submit clean patch files, please.
          $ patch -p0 < TAJO-294.patch
          (Stripping trailing CRs from patch.)
          patching file tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java
          (Stripping trailing CRs from patch.)
          patching file tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java
          (Stripping trailing CRs from patch.)
          patching file tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerResource.java
          (Stripping trailing CRs from patch.)
          patching file tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/util/JSPUtil.java
          (Stripping trailing CRs from patch.)
          patching file tajo-core/tajo-core-backend/src/main/resources/webapps/admin/catalogview.jsp
          (Stripping trailing CRs from patch.)
          patching file tajo-core/tajo-core-backend/src/main/resources/webapps/admin/cluster.jsp
          (Stripping trailing CRs from patch.)
          patching file tajo-core/tajo-core-backend/src/main/resources/webapps/worker/index.jsp
          
        Show
        jihoonson Jihoon Son added a comment - It is a great work, but there are some minor issues. First of all, when TajoWorkerResourceManager.close() is called, WorkerMonitor.interrupt() should be called instead of WorkerMonitor.interrupted(). Also, it will be better that class names represent their properties. So, how about change the class name of WorkerMonitor to WorkerMonitorThread like WorkerResourceAllocationThread? I got the following message while applying the patch. Submit clean patch files, please. $ patch -p0 < TAJO-294.patch (Stripping trailing CRs from patch.) patching file tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java (Stripping trailing CRs from patch.) patching file tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java (Stripping trailing CRs from patch.) patching file tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerResource.java (Stripping trailing CRs from patch.) patching file tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/util/JSPUtil.java (Stripping trailing CRs from patch.) patching file tajo-core/tajo-core-backend/src/main/resources/webapps/admin/catalogview.jsp (Stripping trailing CRs from patch.) patching file tajo-core/tajo-core-backend/src/main/resources/webapps/admin/cluster.jsp (Stripping trailing CRs from patch.) patching file tajo-core/tajo-core-backend/src/main/resources/webapps/worker/index.jsp
        Hide
        sirpkt Keuntae Park added a comment -

        Thank you for the kind review, Jihoon.

        I've uploaded the new patch, which

        • changes class name: WorkerMonitior -> WorkerMonitorThread
        • fixes the bug: interrupted() -> interrupt()
        • cleans the patch file
        Show
        sirpkt Keuntae Park added a comment - Thank you for the kind review, Jihoon. I've uploaded the new patch, which changes class name: WorkerMonitior -> WorkerMonitorThread fixes the bug: interrupted() -> interrupt() cleans the patch file
        Hide
        jihoonson Jihoon Son added a comment -

        +1
        Thanks for your patch!
        It passes 'mvn verify' and I tested the dead worker management process on my local tajo.
        I'm going to commit the patch now.

        Show
        jihoonson Jihoon Son added a comment - +1 Thanks for your patch! It passes 'mvn verify' and I tested the dead worker management process on my local tajo. I'm going to commit the patch now.
        Hide
        jihoonson Jihoon Son added a comment -

        I modified the second patch to print a log message when a master receives heartbeats from dead workers again.

        Show
        jihoonson Jihoon Son added a comment - I modified the second patch to print a log message when a master receives heartbeats from dead workers again.
        Hide
        hyunsik Hyunsik Choi added a comment -

        +1
        The looks good for me.

        Show
        hyunsik Hyunsik Choi added a comment - +1 The looks good for me.
        Hide
        jihoonson Jihoon Son added a comment -

        I've just committed it.

        Show
        jihoonson Jihoon Son added a comment - I've just committed it.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tajo-trunk-postcommit #543 (See https://builds.apache.org/job/Tajo-trunk-postcommit/543/)
        TAJO-294: Removing dead workers from the live worker list. (Keuntae Park via jihoon) (jihoonson: https://git-wip-us.apache.org/repos/asf?p=incubator-tajo.git&a=commit&h=8e6fec0fcb95c24017ee082df44853a222eb9db5)

        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/util/JSPUtil.java
        • tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java
        • tajo-core/tajo-core-backend/src/main/resources/webapps/admin/catalogview.jsp
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerResource.java
        • tajo-core/tajo-core-backend/src/main/resources/webapps/worker/index.jsp
        • tajo-core/tajo-core-backend/src/main/resources/webapps/admin/cluster.jsp
        • tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tajo-trunk-postcommit #543 (See https://builds.apache.org/job/Tajo-trunk-postcommit/543/ ) TAJO-294 : Removing dead workers from the live worker list. (Keuntae Park via jihoon) (jihoonson: https://git-wip-us.apache.org/repos/asf?p=incubator-tajo.git&a=commit&h=8e6fec0fcb95c24017ee082df44853a222eb9db5 ) tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/util/JSPUtil.java tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java tajo-core/tajo-core-backend/src/main/resources/webapps/admin/catalogview.jsp tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/WorkerResource.java tajo-core/tajo-core-backend/src/main/resources/webapps/worker/index.jsp tajo-core/tajo-core-backend/src/main/resources/webapps/admin/cluster.jsp tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/rm/TajoWorkerResourceManager.java
        Hide
        hyunsik Hyunsik Choi added a comment -

        fixing it as resolved.

        Show
        hyunsik Hyunsik Choi added a comment - fixing it as resolved.

          People

          • Assignee:
            sirpkt Keuntae Park
            Reporter:
            sirpkt Keuntae Park
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development