Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1782

svd for sparse matrix using ARPACK

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.1.0
    • MLlib
    • None

    Description

      Currently the svd implementation in mllib calls the dense matrix svd in breeze, which has a limitation of fitting n^2 Gram matrix entries in memory (n is the number of rows or number of columns of the matrix, whichever is smaller). In many use cases, the original matrix is sparse but the Gram matrix might not, and we often need only the largest k singular values/vectors. To make svd really scalable, the memory usage must be propositional to the non-zero entries in the matrix.

      One solution is to call the de facto standard eigen-decomposition package ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a reverse communication interface. The user provides a function to multiply a square matrix to be decomposed with a dense vector provided by ARPACK, and return the resulting dense vector to ARPACK. Inside ARPACK it uses an Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need to provide are two matrix-vector multiplications, first M*x then M^t*x. These multiplications can be done in Spark in a distributed manner.

      The working memory used by ARPACK is O(n*k). When k (the number of desired singular values) is small, it can be easily fit into the memory of the master machine. The overall model is master machine runs ARPACK, and distribute matrix-vector multiplication onto working executors in each iteration.

      I made a PR to breeze with an ARPACK-backed svds interface (https://github.com/scalanlp/breeze/pull/240). The interface takes anything that can be multiplied by a DenseVector. On Spark/milib side, just need to implement the sparsematrix-vector multiplication.

      It might take some time to optimize and fully test this implementation, so set the workload estimate to 4 weeks.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            vrilleup Li Pu
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 672h
                672h
                Remaining:
                Remaining Estimate - 672h
                672h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment