Details

    • Type: Umbrella Umbrella
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This umbrella summarizes all issues related with checkpointing and task restarting to archieve fault tolerance on the job level.

        Issue Links

          Activity

          Hide
          Thomas Jungblut added a comment -

          Suraj and I had a small meeting on FT in 0.6.0, here are our first iteration result:

          First focus on 0.6.0

          1. Checkpointing on receive side HAMA-557
            1. ZK stores successful superstep checkpointing files / paths
          2. When fault happens:
            1. Single task recovery (when fault happens inside of computation)
              1. Groom detects failure, flag the task as fail and redirects a new task schedule to the scheduler(HAMA-534), BSPTask#run takes care of correct filling of message queue in BSPPeerImpl and MessageManager.
            2. Global recovery (when fault happens during sync or checkpointing)
              1. All tasks must fail and rescheduled with the last successful superstep
          3. Restart the task(s) with Superstep API HAMA-533
            1. Improve Superstep API with HAMA-546
            2. Improve Superstep API or rather BSP API with following features:
              1. deregister/close (empty the BSP slot)
              2. relieve from sync .. the task runs but would not sync anymore
          Show
          Thomas Jungblut added a comment - Suraj and I had a small meeting on FT in 0.6.0, here are our first iteration result: First focus on 0.6.0 Checkpointing on receive side HAMA-557 ZK stores successful superstep checkpointing files / paths When fault happens: Single task recovery (when fault happens inside of computation) Groom detects failure, flag the task as fail and redirects a new task schedule to the scheduler( HAMA-534 ), BSPTask#run takes care of correct filling of message queue in BSPPeerImpl and MessageManager. Global recovery (when fault happens during sync or checkpointing) All tasks must fail and rescheduled with the last successful superstep Restart the task(s) with Superstep API HAMA-533 Improve Superstep API with HAMA-546 Improve Superstep API or rather BSP API with following features: deregister/close (empty the BSP slot) relieve from sync .. the task runs but would not sync anymore
          Hide
          Hudson added a comment -

          Integrated in Hama-Nightly #633 (See https://builds.apache.org/job/Hama-Nightly/633/)
          Committing the merge from HAMA-505-branch. Contains changes for HAMA-557 HAMA-587 HAMA-610 HAMA-611 (Revision 1369575)

          Result = FAILURE
          surajsmenon :
          Files :

          • /hama/trunk
          • /hama/trunk/conf/hama-default.xml
          • /hama/trunk/core/src/main/java/org/apache/hama/Constants.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPJobClient.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPMaster.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPTask.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServerAction.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/JobInProgress.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/JobInProgressListener.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/JobStatus.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/LaunchTaskAction.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/LocalBSPRunner.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/RecoverTaskAction.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/SimpleTaskScheduler.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/TaskInProgress.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/TaskRunner.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/TaskStatus.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/UpdatePeerAction.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/ft
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/ft/AsyncRcvdMsgCheckpointImpl.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/ft/BSPFaultTolerantService.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/ft/FaultTolerantMasterService.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/ft/FaultTolerantPeerService.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/AbstractMessageManager.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/AvroMessageManagerImpl.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/HadoopMessageManager.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/HadoopMessageManagerImpl.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/MessageEventListener.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/MessageManager.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/BSPMasterSyncClient.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/BSPPeerSyncClient.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/MasterSyncClient.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/PeerSyncClient.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/SyncClient.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/SyncEvent.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/SyncEventListener.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/SyncServiceFactory.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/ZKSyncBSPMasterClient.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/ZKSyncClient.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/ZKSyncEventFactory.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/ZKSyncEventListener.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/ZooKeeperSyncClientImpl.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/taskallocation
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/taskallocation/BSPResource.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/taskallocation/BestEffortDataLocalTaskAllocator.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/taskallocation/RawSplitResource.java
          • /hama/trunk/core/src/main/java/org/apache/hama/bsp/taskallocation/TaskAllocationStrategy.java
          • /hama/trunk/core/src/test/java/org/apache/hama/bsp/TestBSPTaskFaults.java
          • /hama/trunk/core/src/test/java/org/apache/hama/bsp/TestCheckpoint.java
          • /hama/trunk/core/src/test/java/org/apache/hama/bsp/TestTaskAllocation.java
          • /hama/trunk/core/src/test/java/org/apache/hama/bsp/TestZooKeeper.java
          • /hama/trunk/core/src/test/java/org/apache/hama/bsp/sync/TestSyncServiceFactory.java
          Show
          Hudson added a comment - Integrated in Hama-Nightly #633 (See https://builds.apache.org/job/Hama-Nightly/633/ ) Committing the merge from HAMA-505 -branch. Contains changes for HAMA-557 HAMA-587 HAMA-610 HAMA-611 (Revision 1369575) Result = FAILURE surajsmenon : Files : /hama/trunk /hama/trunk/conf/hama-default.xml /hama/trunk/core/src/main/java/org/apache/hama/Constants.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPJobClient.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPMaster.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPPeerImpl.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPTask.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServerAction.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/JobInProgress.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/JobInProgressListener.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/JobStatus.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/LaunchTaskAction.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/LocalBSPRunner.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/RecoverTaskAction.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/SimpleTaskScheduler.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/TaskInProgress.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/TaskRunner.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/TaskStatus.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/UpdatePeerAction.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/ft /hama/trunk/core/src/main/java/org/apache/hama/bsp/ft/AsyncRcvdMsgCheckpointImpl.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/ft/BSPFaultTolerantService.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/ft/FaultTolerantMasterService.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/ft/FaultTolerantPeerService.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/AbstractMessageManager.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/AvroMessageManagerImpl.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/HadoopMessageManager.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/HadoopMessageManagerImpl.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/MessageEventListener.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/message/MessageManager.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/BSPMasterSyncClient.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/BSPPeerSyncClient.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/MasterSyncClient.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/PeerSyncClient.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/SyncClient.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/SyncEvent.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/SyncEventListener.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/SyncServiceFactory.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/ZKSyncBSPMasterClient.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/ZKSyncClient.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/ZKSyncEventFactory.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/ZKSyncEventListener.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/sync/ZooKeeperSyncClientImpl.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/taskallocation /hama/trunk/core/src/main/java/org/apache/hama/bsp/taskallocation/BSPResource.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/taskallocation/BestEffortDataLocalTaskAllocator.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/taskallocation/RawSplitResource.java /hama/trunk/core/src/main/java/org/apache/hama/bsp/taskallocation/TaskAllocationStrategy.java /hama/trunk/core/src/test/java/org/apache/hama/bsp/TestBSPTaskFaults.java /hama/trunk/core/src/test/java/org/apache/hama/bsp/TestCheckpoint.java /hama/trunk/core/src/test/java/org/apache/hama/bsp/TestTaskAllocation.java /hama/trunk/core/src/test/java/org/apache/hama/bsp/TestZooKeeper.java /hama/trunk/core/src/test/java/org/apache/hama/bsp/sync/TestSyncServiceFactory.java

            People

            • Assignee:
              Unassigned
              Reporter:
              Thomas Jungblut
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development