Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1273

Single Pass Algorithm for Penalized Linear Regression with Cross Validation on MapReduce

    XMLWordPrintableJSON

Details

    Description

      Penalized linear regression such as Lasso, Elastic-net are widely used in machine learning, but there are no very efficient scalable implementations on MapReduce.

      The published distributed algorithms for solving this problem is either iterative (which is not good for MapReduce, see Steven Boyd's paper) or approximate (what if we need exact solutions, see Paralleled stochastic gradient descent); another disadvantage of these algorithms is that they can not do cross validation in the training phase, which requires a user-specified penalty parameter in advance.

      My ideas can train the model with cross validation in a single pass. They are based on some simple observations.

      The core algorithm is a modified version of coordinate descent (see J. Freedman's paper). They implemented a very efficient R package "glmnet", which is the de facto standard of penalized regression.

      I have implemented the primitive version of this algorithm in Alpine Data Labs.

      Attachments

        1. Algorithm and Numeric Stability.pdf
          189 kB
          Kun Yang
        2. Examples.pdf
          180 kB
          Kun Yang
        3. java files.pdf
          91 kB
          Kun Yang
        4. Manual and Example.pdf
          165 kB
          Kun Yang
        5. Manual and Example.pdf
          164 kB
          Kun Yang
        6. Notes.pdf
          124 kB
          Kun Yang
        7. PenalizedLinear.pdf
          81 kB
          Kun Yang
        8. PenalizedLinearRegression.patch
          119 kB
          Kun Yang

        Activity

          People

            Unassigned Unassigned
            kunyang Kun Yang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 720h
                720h
                Remaining:
                Remaining Estimate - 720h
                720h
                Logged:
                Time Spent - Not Specified
                Not Specified