Uploaded image for project: 'Apache Submarine'
  1. Apache Submarine
  2. SUBMARINE-857

[Umbrella] Support model management SDK in distributed scenerios

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.6.0
    • None

    Description

      Submarine is a platform designed for distributed training, so its model management SDK should be easier to use in distributed scenarios.

      In a general distributed experiment, there are several workers training together.

      Our model management toolkit will support:
      1. The workers in the same experiment will automatically direct their logs to the same group in mlflow, so users can monitor multiple workers' info in one graph.
      2. When saving models, users do not need to store all the workers' because some are replicated or redundant. Calling save_model in our toolkit, we will apply the most efficient saving strategy under the hood, which can cost the least space and time.

      The API design doc can be viewed here: https://hackmd.io/I6frSeZIQDaKQYK4nGCR5w?both

      Attachments

        Issue Links

          Activity

            People

              byronhsu Byron Hsu
              byronhsu Byron Hsu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: