Uploaded image for project: 'Singa'
  1. Singa
  2. SINGA-12

Supprt Checkpoint and Restore

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None

    Description

      With the support of checkpoint, we can provide following features:

      1. Failure Recovery: when a task is failed during the training, we can recover the task from the latest checkpoint;
      2. Continuous Training: when the user checks the trained model and finds that more steps are needed, he can continue the training;
      3. Parameter Reuse: from a previously trained model, we can create a new model by adding new layers on top of it, and reuse the parameters during the training.

      The checkpoint should be done on the server side every few steps. In addition, a final checkpoint will be made when the task is finished.

      During restore, the servers/workers will be firstly set up as normal, and after that parameters will be loaded from the checkpoint file.

      Attachments

        Activity

          People

            wangsh Wang Sheng
            wangsh Wang Sheng
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 504h
                504h
                Remaining:
                Remaining Estimate - 504h
                504h
                Logged:
                Time Spent - Not Specified
                Not Specified