Uploaded image for project: 'Apache Submarine'
  1. Apache Submarine
  2. SUBMARINE-23

[Submarine] Job monitor long-running service of submarine

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None

    Description

      Job monitor long-running service of submarine

      After training, the monitoring program need auto close PS service. It is possible that other deep learning frameworks also have some custom processing when the tasks are in different states.

      The submarine needs to provide a long-term resident service that monitors each JOB mission.

      This monitoring service can be processed differently according to the training tasks of different depth learning framework types.

      For example: Tensorflow performs distributed training, when the training is completed,

      The PS service cannot be automatically stopped. At this time, the PS needs to be actively stopped by the monitoring service.

      Attachments

        Activity

          People

            liuxun323 Xun Liu
            liuxun323 Xun Liu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: