Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      It aims to introduce the checkpointing to guarantee that the worker could recover from previous failure. In details, once a worker is brought up it pulls the current state of the model which consists of each worker's process (i.e., which batch iteration and epoch is being executing). And the checkpointing could be set to EPOCH10 which means that every 10 epoch the state will be persisted in centralized file on server side.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Guobao LI Guobao
              Reporter:
              Guobao LI Guobao

              Dates

              • Created:
                Updated:

                Issue deployment