Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.5.0
    • Component/s: bsp core
    • Labels:
    • Environment:

      GNU/ Debian, JDK 1.6.0_22-b04

      Description

      In order to enable fault tolerance service, BSPMaster requires to have ability in determining GroomServers' status. This generally can be achieved through failure detector. The attached file contains source for such patch.

      1. HAMA-370.patch
        26 kB
        ChiaHung Lin
      2. HAMA-370.patch
        30 kB
        ChiaHung Lin
      3. HAMA-370.patch
        117 kB
        ChiaHung Lin

        Activity

        Hide
        ChiaHung Lin added a comment -

        The implemented of failure detector employs [1].

        [1]. The ϕ Accrual Failure Detector. http://ddsg.jaist.ac.jp/pub/HDY+04.pdf

        Show
        ChiaHung Lin added a comment - The implemented of failure detector employs [1] . [1] . The ϕ Accrual Failure Detector. http://ddsg.jaist.ac.jp/pub/HDY+04.pdf
        Hide
        Hudson added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12474159/HAMA-370.patch
        against trunk revision 1081723.

        @author +1. The patch does not contain any @author tags.

        tests included -1. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        javadoc -1. The javadoc tool appears to have generated 1 warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs -1. The patch appears to cause Findbugs to fail.

        core tests -1. The patch failed core unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hama-Patch/315/testReport/
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hama-Patch/315/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hama-Patch/315/console

        This message is automatically generated.

        Show
        Hudson added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12474159/HAMA-370.patch against trunk revision 1081723. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to cause Findbugs to fail. core tests -1. The patch failed core unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hama-Patch/315/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hama-Patch/315/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hama-Patch/315/console This message is automatically generated.
        Hide
        ChiaHung Lin added a comment -

        The previous attachment misses a unit test case.

        Show
        ChiaHung Lin added a comment - The previous attachment misses a unit test case.
        Hide
        Edward J. Yoon added a comment -

        Hi, I can't access to http://ddsg.jaist.ac.jp/pub/HDY+04.pdf

        Could you upload somewhere that paper? (or email)

        Show
        Edward J. Yoon added a comment - Hi, I can't access to http://ddsg.jaist.ac.jp/pub/HDY+04.pdf Could you upload somewhere that paper? (or email)
        Hide
        Edward J. Yoon added a comment -

        I just found by googling.

        Show
        Edward J. Yoon added a comment - I just found by googling.
        Hide
        Edward J. Yoon added a comment - - edited

        What's the benefits of using a phi accrual detection compared w/ heartbeat detection? (GroomServer Failures)

        And, when BSP task failed during processing, we can simply re-start the task to provides fault tolerance.

        Show
        Edward J. Yoon added a comment - - edited What's the benefits of using a phi accrual detection compared w/ heartbeat detection? (GroomServer Failures) And, when BSP task failed during processing, we can simply re-start the task to provides fault tolerance.
        Hide
        ChiaHung Lin added a comment -

        Indeed, the implementation in patch also contains heartbeat mechanism - the monitored process periodically sending heartbeat.

        The different is a conventional heartbeat failure detector has a fixed timeout. The phi accrual failure detector decomposes functions into different components (monitoring, interpretation, etc.); with a suspicion level (not binary trust or suspect value output) exposed so that different applications equipped with its own interpreter can use the output value for further decision. For instance, a master may allocate urgent tasks to workers which have lower suspicion level. Or the monitoring process may interpret according to its business logic in determining if monitored process has crashed.

        Although a task failure can be solved with a restart, the difficulty lies in the distinguished between a crash/ failure process and a very slow one. In addition, in the future if the project needs the feature of fault tolerant between bspmasters, a failure detection service is required.

        Show
        ChiaHung Lin added a comment - Indeed, the implementation in patch also contains heartbeat mechanism - the monitored process periodically sending heartbeat. The different is a conventional heartbeat failure detector has a fixed timeout. The phi accrual failure detector decomposes functions into different components (monitoring, interpretation, etc.); with a suspicion level (not binary trust or suspect value output) exposed so that different applications equipped with its own interpreter can use the output value for further decision. For instance, a master may allocate urgent tasks to workers which have lower suspicion level. Or the monitoring process may interpret according to its business logic in determining if monitored process has crashed. Although a task failure can be solved with a restart, the difficulty lies in the distinguished between a crash/ failure process and a very slow one. In addition, in the future if the project needs the feature of fault tolerant between bspmasters, a failure detection service is required.
        Hide
        Edward J. Yoon added a comment -

        Hi

        >> For instance, a master may allocate urgent tasks to workers which have lower suspicion level.

        >> the difficulty lies in the distinguished between a crash/ failure process and a very slow one. In addition, in the future if the project needs

        +1.

        BTW, another one question. By using phi accrual failure detecter, can HAMA-363 issue be solved?

        Thanks!

        Show
        Edward J. Yoon added a comment - Hi >> For instance, a master may allocate urgent tasks to workers which have lower suspicion level. >> the difficulty lies in the distinguished between a crash/ failure process and a very slow one. In addition, in the future if the project needs +1. BTW, another one question. By using phi accrual failure detecter, can HAMA-363 issue be solved? Thanks!
        Hide
        ChiaHung Lin added a comment -

        If I understand correctly, HAMA-363 seems to be relating to network monitoring, whose responsibility is to monitor network and alert administrators when events raise; but the purpose of failure detector is to detect node crashes in a distributed system.

        Show
        ChiaHung Lin added a comment - If I understand correctly, HAMA-363 seems to be relating to network monitoring, whose responsibility is to monitor network and alert administrators when events raise; but the purpose of failure detector is to detect node crashes in a distributed system.
        Hide
        Edward J. Yoon added a comment -

        Yes, you're right.

        I thought there's a similarity in purpose, bc network status can be used to handle faults or stragglers.

        This could be a really importanat part of our system design. It would be nice if you can provide more detailed information and plan. Please feel free to edit the wiki! http://wiki.apache.org/hama and keep up the great work!

        Show
        Edward J. Yoon added a comment - Yes, you're right. I thought there's a similarity in purpose, bc network status can be used to handle faults or stragglers. This could be a really importanat part of our system design. It would be nice if you can provide more detailed information and plan. Please feel free to edit the wiki! http://wiki.apache.org/hama and keep up the great work!
        Hide
        Edward J. Yoon added a comment -

        P.S., or your own technical report also is OK!

        Show
        Edward J. Yoon added a comment - P.S., or your own technical report also is OK!
        Hide
        ChiaHung Lin added a comment -

        This probably provides some information that can be applied to improve with hadoop's failure detection e.g. static threshold value.
        http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final15.pdf

        Show
        ChiaHung Lin added a comment - This probably provides some information that can be applied to improve with hadoop's failure detection e.g. static threshold value. http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final15.pdf
        Hide
        Edward J. Yoon added a comment -

        Let's add this to trunk.

        Show
        Edward J. Yoon added a comment - Let's add this to trunk.
        Hide
        ChiaHung Lin added a comment -

        patch update in reflecting the change in recent repos.

        Show
        ChiaHung Lin added a comment - patch update in reflecting the change in recent repos.
        Hide
        Hudson added a comment -

        Integrated in Hama-Nightly #511 (See https://builds.apache.org/job/Hama-Nightly/511/)
        HAMA-370 Failure detector for Hama (Revision 1310987)

        Result = SUCCESS
        chl501 :
        Files :

        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/metrics
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/Interpreter.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/Node.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/Sensor.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/SimpleBinaryInterpreter.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/Supervisor.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/UDPSensor.java
        • /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/UDPSupervisor.java
        • /incubator/hama/trunk/core/src/test/java/org/apache/hama/metrics
        • /incubator/hama/trunk/core/src/test/java/org/apache/hama/monitor/fd
        • /incubator/hama/trunk/core/src/test/java/org/apache/hama/monitor/fd/TestFD.java
        Show
        Hudson added a comment - Integrated in Hama-Nightly #511 (See https://builds.apache.org/job/Hama-Nightly/511/ ) HAMA-370 Failure detector for Hama (Revision 1310987) Result = SUCCESS chl501 : Files : /incubator/hama/trunk/core/src/main/java/org/apache/hama/metrics /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/Interpreter.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/Node.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/Sensor.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/SimpleBinaryInterpreter.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/Supervisor.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/UDPSensor.java /incubator/hama/trunk/core/src/main/java/org/apache/hama/monitor/fd/UDPSupervisor.java /incubator/hama/trunk/core/src/test/java/org/apache/hama/metrics /incubator/hama/trunk/core/src/test/java/org/apache/hama/monitor/fd /incubator/hama/trunk/core/src/test/java/org/apache/hama/monitor/fd/TestFD.java

          People

          • Assignee:
            ChiaHung Lin
            Reporter:
            ChiaHung Lin
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development