Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6567

Large linear model parallelism via a join and reduceByKey

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • ML, MLlib
    • None

    Description

      To train a linear model, each training point in the training set needs its dot product computed against the model, per iteration. If the model is large (too large to fit in memory on a single machine) then SPARK-4590 proposes using parameter server.

      There is an easier way to achieve this without parameter servers. In particular, if the data is held as a BlockMatrix and the model as an RDD, then each block can be joined with the relevant part of the model, followed by a reduceByKey to compute the dot products.

      This obviates the need for a parameter server, at least for linear models. However, it's unclear how it compares performance-wise to parameter servers.

      Attachments

        1. model-parallelism.pptx
          243 kB
          hucheng zhou

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rezazadeh Reza Zadeh
              Xiangrui Meng Xiangrui Meng
              Votes:
              2 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: