Uploaded image for project: 'SystemML'
  1. SystemML
  2. SYSTEMML-2083

Language and runtime for parameter servers

    Details

    • Type: Epic
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Epic Name:
      Language and runtime for parameter servers

      Description

      SystemML already provides a rich set of execution strategies ranging from local operations to large-scale computation on MapReduce or Spark. In this context, we support both data-parallel (multi-threaded or distributed operations) as well as task-parallel computation (multi-threaded or distributed parfor loops). This epic aims to complement the existing execution strategies by language and runtime primitives for parameter servers, i.e., model-parallel execution. We use the terminology of model-parallel execution with distributed data and distributed model to differentiate them from the existing data-parallel operations. Target applications are distributed deep learning and mini-batch algorithms in general. These new abstractions will help making SystemML a unified framework for small- and large-scale machine learning that supports all three major execution strategies in a single framework.

       

      A major challenge is the integration of stateful parameter servers and their common push/pull primitives into an otherwise functional (and thus, stateless) language. We will approach this challenge via a new builtin function paramserv which internally maintains state but at the same time fits into the runtime framework of stateless operations.

      Furthermore, we are interested in providing (1) different runtime backends (local and distributed), (2) different parameter server modes (synchronous, asynchronous, hogwild!, stale-synchronous), (3) different update frequencies (batch, multi-batch, epoch), as well as (4) different architectures for distributed data (1 parameter server, k workers) and distributed model (k1 parameter servers, k2 workers). 

       

      Note for GSOC students: This is large project which will be broken down into sub projects, so everybody will be having their share of pie.

      Prerequistes: Java, machine learning experience is a plus but not required.

        Attachments

          Issue Links

          1.
          Preparation of dev environment Sub-task Closed LI Guobao
          2.
          API design of the paramserv function Sub-task Resolved LI Guobao
          3.
          Implementation of a script with paramserv func Sub-task Closed LI Guobao
          4.
          Implementation of language and compiler extension Technical task Closed LI Guobao
          5.
          Implementation of language extension Sub-task Closed LI Guobao
          6.
          Hops, lops, instruction generation Sub-task Closed LI Guobao
          7.
          IPA integration Sub-task Closed LI Guobao
          8.
          Parfor integration Sub-task Closed LI Guobao
          9.
          Initial version of local backend Technical task Resolved LI Guobao
          10.
          Local server Sub-task Closed LI Guobao
          11.
          Data partition Sub-task Closed LI Guobao
          12.
          Push/pull service Sub-task Closed LI Guobao
          13.
          Local workers Sub-task Closed LI Guobao
          14.
          Aggregation service Sub-task Closed LI Guobao
          15.
          Determine the level of parallelism Sub-task Closed LI Guobao
          16.
          Add the mnist-lenet test Sub-task Closed LI Guobao
          17.
          Add some optional argument Sub-task Closed LI Guobao
          18.
          Synchronization Sub-task Closed LI Guobao
          19.
          Checkpointing Sub-task Open LI Guobao
          20.
          Rewrite the ps using BlockingQueue Sub-task Closed LI Guobao
          21.
          Error handling Sub-task Closed LI Guobao
          22.
          Inline the agg service thread in ps Sub-task Closed LI Guobao
          23.
          Passing function pointers Sub-task Open LI Guobao
          24.
          Extend the update strategy ASP Sub-task Closed LI Guobao
          25.
          Extend update per EPOCH Sub-task Closed LI Guobao
          26.
          Extend scheme Disjoint_Round_Robin Sub-task Closed LI Guobao
          27.
          Extend scheme Disjoint_Random Sub-task Closed LI Guobao
          28.
          Extend scheme Overlap_Reshuffle Sub-task Closed LI Guobao
          29.
          Extend the update strategy SSP Sub-task Open LI Guobao
          30.
          Reorganise the test Sub-task Closed LI Guobao
          31.
          Preparation of baseline experiments Technical task Resolved LI Guobao
          32.
          Add unit test for data partitioner Sub-task Closed LI Guobao
          33.
          Shutdown the thread pool of agg service Sub-task Closed LI Guobao
          34.
          Reorganise the api of data partitioner Sub-task Closed LI Guobao
          35.
          Concurrent problem about LibMatrixMult Sub-task Closed LI Guobao
          36.
          Forget to fetch the last iteration in local workers Sub-task Closed LI Guobao
          37.
          Extend statistics for paramserv func Sub-task Closed LI Guobao
          38.
          First evaluation Sub-task Closed LI Guobao
          39.
          Batch pre-fetching per workers Sub-task Open LI Guobao
          40.
          Rework the function block recompilation Sub-task Closed LI Guobao
          41.
          Use synchronized method instead of single thread pool in agg service Sub-task Closed LI Guobao
          42.
          Initial version of distributed spark backend Sub-task In Progress LI Guobao
          43.
          Determine the level of par Sub-task Open LI Guobao
          44.
          Spark data partitioner Sub-task Resolved LI Guobao
          45.
          Setup and cleanup of remote workers Sub-task Resolved LI Guobao
          46.
          Implementation of remote worker Sub-task Resolved LI Guobao
          47.
          Implementation of spark ps Sub-task Open LI Guobao
          48.
          Communication between ps and workers Sub-task In Progress LI Guobao
          49.
          Task error and preemption handles Sub-task Open LI Guobao
          50.
          Second evaluation Sub-task Closed LI Guobao
          51.
          Add experiments varied on optimizers Sub-task Resolved LI Guobao
          52.
          Error handling and add statistic for spark backend Sub-task Resolved LI Guobao
          53.
          Add experiment on spark paramserv Sub-task Open LI Guobao
          54.
          Second version of execution backend Sub-task Open LI Guobao
          55.
          Integration test, implementation of samples and documentation Sub-task Open LI Guobao
          56.
          Submit final product Sub-task Open LI Guobao
          57.
          Performance features batching and parameter transfers Sub-task Open Unassigned
          58.
          Extended Caffe2DML and Keras2DML script generators Sub-task Open Unassigned
          59.
          Documentation of language extension Sub-task Open Unassigned

            Activity

              People

              • Assignee:
                Guobao LI Guobao
                Reporter:
                mboehm7 Matthias Boehm
              • Votes:
                5 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated: