Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8135

Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Description:

      Goals:

      • Allow infra engineer / data scientist to run unmodified Tensorflow jobs on YARN.
      • Allow jobs easy access data/models in HDFS and other storages.
      • Can launch services to serve Tensorflow/MXNet models.
      • Support run distributed Tensorflow jobs with simple configs.
      • Support run user-specified Docker images.
      • Support specify GPU and other resources.
      • Support launch tensorboard if user specified.
      • Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

      Why this name?

      • Because Submarine is the only vehicle can let human to explore deep places. B-)

      Please refer to on-going design doc, and add your thoughts: https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#

        Attachments

          Issue Links

          1.
          [Submarine] Initial implementation: Training job submission and job history retrieval Sub-task Resolved Wangda Tan
          2.
          [Submarine] Support files/tarballs to be localized for a training job. Sub-task Open Unassigned
          3.
          [Submarine] Support users to specify Python/TF package/version/dependencies for training job. Sub-task Open Unassigned
          4.
          [Submarine] Failed to reset Hadoop home environment when submitting a submarine job Sub-task Resolved Zac Zhou
          5.
          [Submarine] Support create models / versions for training result. Sub-task Open Unassigned
          6.
          [Submarine] Support deploy model serving for existing models Sub-task Open Unassigned
          7.
          [Submarine] Support passing Kerberos principle tokens when launch training jobs. Sub-task Open Unassigned
          8.
          Submarine job staging directory has a lot of useless PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple times Sub-task Open Zhankun Tang
          9.
          [Submarine] Properly handle relative path for staging area Sub-task Resolved Wangda Tan
          10.
          [Submarine] Add Tensorboard component when --tensorboard is specified Sub-task Resolved Wangda Tan
          11.
          [Submarine] Allow user to specify customized quicklink(s) when submit Submarine job Sub-task Resolved Wangda Tan
          12.
          [Submarine] Support using Submarine to submit Pytorch job Sub-task Open Sunil Govindan
          13.
          [Submarine] Job should not be submitted if "--input_path" option is missing Sub-task Open Zhankun Tang
          14.
          [Submarine] Correct the default directory path in HDFS for "checkout_path" Sub-task Resolved Zhankun Tang
          15.
          Updated documentation of Submarine with latest examples. Sub-task Resolved Wangda Tan
          16.
          Enable local staging directory and clean it up when submarine job is submitted Sub-task Patch Available Zac Zhou
          17.
          [Submarine] In cases when user doesn't ask HDFS path while submitting job but framework requires user to set HDFS related environments Sub-task Resolved Wangda Tan
          18.
          [Submarine] Add documentation for submarine installation details Sub-task Resolved Zac Zhou
          19.
          [Submarine] Add submarine installation scripts Sub-task Patch Available Xun Liu
          20.
          [Submarine] Add documentation for submarine installation script details Sub-task Patch Available Xun Liu
          21.
          [Submarine] After training, the monitoring program need auto close PS service Sub-task Open Xun Liu

            Activity

              People

              • Assignee:
                leftnoteasy Wangda Tan
                Reporter:
                leftnoteasy Wangda Tan
              • Votes:
                1 Vote for this issue
                Watchers:
                39 Start watching this issue

                Dates

                • Created:
                  Updated: