Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.5.0
    • Component/s: bsp core
    • Labels:

      Description

      We should extend the BSPJob to let the user set the checkpoint intervals.

      job.setCheckpointInterval(5);

      This method should put the parameter into the configuration of the job with a meaningful key, e.G. "bsp.checkpoint.interval".

      In the BSPPeerImpl we should check if this interval has been reached and make the checkpointing accordingly.
      Checkpointing gets called in BSPPeerImpl#sync(), there is already a condition which checks if checkpointing is enabled.

      Plus points:
      If you can provide an additional method in BSPJob that let's the user enable or disable checkpointing. Hint: Configuration key is: "bsp.checkpoint.enabled".

        Activity

        Hide
        ChiaHung Lin added a comment -

        Can we refactor checkpoint() function to be executed in another thread?

        If I remember correctly, original checkpoint() execution makes use of main thread to save message bundle to hdfs. So if message bundle size is too large, this might delay the whole process. Even message bundle size is not huge, during sync() the process still need to wait after message bundle is saved to hdfs. Then

        it.remove();
        messenger.transfer(addr, bundle);

        can happen.

        Show
        ChiaHung Lin added a comment - Can we refactor checkpoint() function to be executed in another thread? If I remember correctly, original checkpoint() execution makes use of main thread to save message bundle to hdfs. So if message bundle size is too large, this might delay the whole process. Even message bundle size is not huge, during sync() the process still need to wait after message bundle is saved to hdfs. Then it.remove(); messenger.transfer(addr, bundle); can happen.
        Hide
        Edward J. Yoon added a comment -

        Thanks Suraj!

        Show
        Edward J. Yoon added a comment - Thanks Suraj!
        Hide
        Suraj Menon added a comment -

        Hello Please note that this patch contains fix for issue HAMA-498

        Show
        Suraj Menon added a comment - Hello Please note that this patch contains fix for issue HAMA-498
        Hide
        Thomas Jungblut added a comment -

        Does checkpoint interval here imply the number of supersteps before we initiate a checkpoint process?

        Yes.

        Should this be done within barrier synchronization period

        There is already a part in the sync barrier that will do the checkpointing. (arround line 250)

              if (conf.getBoolean("bsp.checkpoint.enabled", false)) {
                checkpoint(checkpointedPath(), bundle);
              }
        

        I guess it is enough to do somekind of modulo checking,

        if(!disabled && getSuperStep() % interval == 0)
           doCheckpoint
        

        Please let me know if I have the correct understanding.

        Yes you have

        Show
        Thomas Jungblut added a comment - Does checkpoint interval here imply the number of supersteps before we initiate a checkpoint process? Yes. Should this be done within barrier synchronization period There is already a part in the sync barrier that will do the checkpointing. (arround line 250) if (conf.getBoolean("bsp.checkpoint.enabled", false)) { checkpoint(checkpointedPath(), bundle); } I guess it is enough to do somekind of modulo checking, if(!disabled && getSuperStep() % interval == 0) doCheckpoint Please let me know if I have the correct understanding. Yes you have
        Hide
        Suraj Menon added a comment -

        Does checkpoint interval here imply the number of supersteps before we initiate a checkpoint process? Should this be done within barrier synchronization period or should we have a Checkpointer daemon as we have for backing up namenode in Hadoop. With the second option, we might loose the determinism in finding/assuming at an instant of time (how many supersteps) or (the last superstep) that have been completely checkpointed. The first approach might make it slower but would have better determinism in checkpoint recovery. Please let me know if I have the correct understanding.

        Show
        Suraj Menon added a comment - Does checkpoint interval here imply the number of supersteps before we initiate a checkpoint process? Should this be done within barrier synchronization period or should we have a Checkpointer daemon as we have for backing up namenode in Hadoop. With the second option, we might loose the determinism in finding/assuming at an instant of time (how many supersteps) or (the last superstep) that have been completely checkpointed. The first approach might make it slower but would have better determinism in checkpoint recovery. Please let me know if I have the correct understanding.
        Hide
        Edward J. Yoon added a comment -

        move to 0.5

        Show
        Edward J. Yoon added a comment - move to 0.5
        Hide
        ChiaHung Lin added a comment -

        We can put messages to queue for checkpointer to periodically pick up and to save data to hdfs.

        Show
        ChiaHung Lin added a comment - We can put messages to queue for checkpointer to periodically pick up and to save data to hdfs.
        Hide
        Edward J. Yoon added a comment -

        ChiaHung,

        I don't think that the basic checkpoint/recovery are heavily related with HAMA-440 and HAMA-363.

        Show
        Edward J. Yoon added a comment - ChiaHung, I don't think that the basic checkpoint/recovery are heavily related with HAMA-440 and HAMA-363 .
        Hide
        Edward J. Yoon added a comment -

        I'm scheduling to 0.4 and taking this task.

        The checkpoint interval also should be configurable so that user can set the optimal "Interval" value for each jobs as below.

          BSPJob job = ...
          job.setCheckpointInterval(5);
        
        Show
        Edward J. Yoon added a comment - I'm scheduling to 0.4 and taking this task. The checkpoint interval also should be configurable so that user can set the optimal "Interval" value for each jobs as below. BSPJob job = ... job.setCheckpointInterval(5);
        Hide
        Edward J. Yoon added a comment -

        +1

        Show
        Edward J. Yoon added a comment - +1
        Hide
        ChiaHung Lin added a comment -

        Although this task is simple, it would be more meaningful that master can 1.) ensure if checkpointed data is complete 2.) obtain info from metrics/ resource system in deciding to which groom new tasks will be scheduled before making checkpoint configurable. So in my personal viewpoint, priority of HAMA-440 #2 and HAMA-363 would be higher than making checkpoint configurable.

        Show
        ChiaHung Lin added a comment - Although this task is simple, it would be more meaningful that master can 1.) ensure if checkpointed data is complete 2.) obtain info from metrics/ resource system in deciding to which groom new tasks will be scheduled before making checkpoint configurable. So in my personal viewpoint, priority of HAMA-440 #2 and HAMA-363 would be higher than making checkpoint configurable.

          People

          • Assignee:
            Suraj Menon
            Reporter:
            Edward J. Yoon
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development