Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-2083

Language and runtime for parameter servers

Agile BoardAttach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Epic
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Epic Name:
      Language and runtime for parameter servers

      Description

      SystemML already provides a rich set of execution strategies ranging from local operations to large-scale computation on MapReduce or Spark. In this context, we support both data-parallel (multi-threaded or distributed operations) as well as task-parallel computation (multi-threaded or distributed parfor loops). This epic aims to complement the existing execution strategies by language and runtime primitives for parameter servers, i.e., model-parallel execution. We use the terminology of model-parallel execution with distributed data and distributed model to differentiate them from the existing data-parallel operations. Target applications are distributed deep learning and mini-batch algorithms in general. These new abstractions will help making SystemML a unified framework for small- and large-scale machine learning that supports all three major execution strategies in a single framework.

       

      A major challenge is the integration of stateful parameter servers and their common push/pull primitives into an otherwise functional (and thus, stateless) language. We will approach this challenge via a new builtin function paramserv which internally maintains state but at the same time fits into the runtime framework of stateless operations.

      Furthermore, we are interested in providing (1) different runtime backends (local and distributed), (2) different parameter server modes (synchronous, asynchronous, hogwild!, stale-synchronous), (3) different update frequencies (batch, multi-batch, epoch), as well as (4) different architectures for distributed data (1 parameter server, k workers) and distributed model (k1 parameter servers, k2 workers). 

       

      Note for GSOC students: This is large project which will be broken down into sub projects, so everybody will be having their share of pie.

      Prerequistes: Java, machine learning experience is a plus but not required.

        Attachments

        Issue Links

        1.
        Preparation of dev environment Sub-task Closed LI Guobao Actions
        2.
        API design of the paramserv function Sub-task Resolved LI Guobao Actions
        3.
        Implementation of a script with paramserv func Sub-task Closed LI Guobao Actions
        4.
        Implementation of language and compiler extension Technical task Closed LI Guobao Actions
        5.
        Implementation of language extension Sub-task Closed LI Guobao Actions
        6.
        Hops, lops, instruction generation Sub-task Closed LI Guobao Actions
        7.
        IPA integration Sub-task Closed LI Guobao Actions
        8.
        Parfor integration Sub-task Closed LI Guobao Actions
        9.
        Initial version of local backend Technical task Resolved LI Guobao Actions
        10.
        Local server Sub-task Closed LI Guobao Actions
        11.
        Data partition Sub-task Closed LI Guobao Actions
        12.
        Push/pull service Sub-task Closed LI Guobao Actions
        13.
        Local workers Sub-task Closed LI Guobao Actions
        14.
        Aggregation service Sub-task Closed LI Guobao Actions
        15.
        Determine the level of parallelism Sub-task Closed LI Guobao Actions
        16.
        Add the mnist-lenet test Sub-task Closed LI Guobao Actions
        17.
        Add some optional argument Sub-task Closed LI Guobao Actions
        18.
        Synchronization Sub-task Closed LI Guobao Actions
        19.
        Checkpointing Sub-task Open LI Guobao Actions
        20.
        Rewrite the ps using BlockingQueue Sub-task Closed LI Guobao Actions
        21.
        Error handling Sub-task Closed LI Guobao Actions
        22.
        Inline the agg service thread in ps Sub-task Closed LI Guobao Actions
        23.
        Passing function pointers Sub-task Open LI Guobao Actions
        24.
        Extend the update strategy ASP Sub-task Closed LI Guobao Actions
        25.
        Extend update per EPOCH Sub-task Closed LI Guobao Actions
        26.
        Extend scheme Disjoint_Round_Robin Sub-task Closed LI Guobao Actions
        27.
        Extend scheme Disjoint_Random Sub-task Closed LI Guobao Actions
        28.
        Extend scheme Overlap_Reshuffle Sub-task Closed LI Guobao Actions
        29.
        Extend the update strategy SSP Sub-task Open LI Guobao Actions
        30.
        Reorganise the test Sub-task Closed LI Guobao Actions
        31.
        Preparation of baseline experiments Technical task Resolved LI Guobao Actions
        32.
        Add unit test for data partitioner Sub-task Closed LI Guobao Actions
        33.
        Shutdown the thread pool of agg service Sub-task Closed LI Guobao Actions
        34.
        Reorganise the api of data partitioner Sub-task Closed LI Guobao Actions
        35.
        Concurrent problem about LibMatrixMult Sub-task Closed LI Guobao Actions
        36.
        Forget to fetch the last iteration in local workers Sub-task Closed LI Guobao Actions
        37.
        Extend statistics for paramserv func Sub-task Closed LI Guobao Actions
        38.
        First evaluation Sub-task Closed LI Guobao Actions
        39.
        Batch pre-fetching per workers Sub-task Open LI Guobao Actions
        40.
        Rework the function block recompilation Sub-task Closed LI Guobao Actions
        41.
        Use synchronized method instead of single thread pool in agg service Sub-task Closed LI Guobao Actions
        42.
        Initial version of distributed spark backend Sub-task Resolved LI Guobao Actions
        43.
        Determine the level of par Sub-task Closed LI Guobao Actions
        44.
        Spark data partitioner Sub-task Resolved LI Guobao Actions
        45.
        Setup and cleanup of remote workers Sub-task Resolved LI Guobao Actions
        46.
        Implementation of remote worker Sub-task Resolved LI Guobao Actions
        47.
        Implementation of spark ps Sub-task Resolved LI Guobao Actions
        48.
        Communication between ps and workers Sub-task Resolved LI Guobao Actions
        49.
        Task error and preemption handles Sub-task Open LI Guobao Actions
        50.
        Second evaluation Sub-task Closed LI Guobao Actions
        51.
        Add experiments varied on optimizers Sub-task Resolved LI Guobao Actions
        52.
        Error handling and add statistics for spark backend Sub-task Resolved LI Guobao Actions
        53.
        Add experiment on spark paramserv Sub-task Resolved LI Guobao Actions
        54.
        Keep data consistency for a pre-trained model Sub-task Open LI Guobao Actions
        55.
        Add java doc Sub-task Resolved LI Guobao Actions
        56.
        Documentation of language extension Sub-task Resolved LI Guobao Actions
        57.
        Submit final product Sub-task Resolved LI Guobao Actions
        58.
        Second version of execution backend Sub-task Closed LI Guobao Actions
        59.
        Integration test, implementation of samples and documentation Sub-task Open LI Guobao Actions
        60.
        Performance features batching and parameter transfers Sub-task Open Unassigned Actions
        61.
        Extended Caffe2DML and Keras2DML script generators Sub-task Open Unassigned Actions

          Activity

            People

            • Assignee:
              Guobao LI Guobao
              Reporter:
              mboehm7 Matthias Boehm

              Dates

              • Created:
                Updated:

                Issue deployment