Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-1760

Improve engine robustness of distributed SGD training

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Algorithms, Compiler, ParFor
    • Labels:
      None

      Description

      Currently, we have a mathematical framework in place for training with distributed SGD in a distributed MNIST LeNet example . This task aims to push this at scale to determine (1) the current behavior of the engine (i.e. does the optimizer actually run this in a distributed fashion, and (2) ways to improve the robustness and performance for this scenario. The distributed SGD framework from this example has already been ported into Caffe2DML, and thus improvements made for this task will directly benefit our efforts towards distributed training of Caffe models (and Keras in the future).

        Attachments

        1. Runtime_Table.png
          42 kB
          Fei Hu

          Issue Links

            Activity

              People

              • Assignee:
                Tenma Fei Hu
                Reporter:
                dusenberrymw Mike Dusenberry
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: