Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6323

Large rank matrix factorization with Nonlinear loss and constraints

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ML, MLlib
    • Labels:
      None

      Description

      Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z)

      g(z) can be one of the constraints from Breeze proximal library:
      https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

      In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to Xiangrui Meng recent changes, it's straightforward to do it now !

      ALM will be capable of solving the following problems: min f ( x ) + g ( z )

      1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss

      2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb <= x <= ub yet. Most likely we don't need that for ML applications

      3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed.

      https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

      4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse.

      This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here.

      As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)...

      This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function.

        Activity

        Hide
        mlnick Nick Pentreath added a comment -

        I think it is safe to say this will not be feasible to incorporate into Spark itself in the short- to medium-term.

        Show
        mlnick Nick Pentreath added a comment - I think it is safe to say this will not be feasible to incorporate into Spark itself in the short- to medium-term.
        Hide
        debasish83 Debasish Das added a comment -

        Petuum paper that got released today mentioned going to larger topic size (~10-100K) http://arxiv.org/pdf/1412.1576v1.pdf

        Show
        debasish83 Debasish Das added a comment - Petuum paper that got released today mentioned going to larger topic size (~10-100K) http://arxiv.org/pdf/1412.1576v1.pdf
        Hide
        debasish83 Debasish Das added a comment -

        I did some more reading and realized that even for small ranks, the current least square approach won't be able to handle both KL divergence loss (unless there is an approximation that I am missing, more discussions on https://issues.apache.org/jira/browse/SPARK-2426) and PLSA formulation (KL divergence loss with additional constraints)...Even for collaborative filtering with small ranks, this code will be useful...

        Show
        debasish83 Debasish Das added a comment - I did some more reading and realized that even for small ranks, the current least square approach won't be able to handle both KL divergence loss (unless there is an approximation that I am missing, more discussions on https://issues.apache.org/jira/browse/SPARK-2426 ) and PLSA formulation (KL divergence loss with additional constraints)...Even for collaborative filtering with small ranks, this code will be useful...
        Hide
        debasish83 Debasish Das added a comment -

        By the way I can close the JIRA if it is not related to broader interest of the community...

        Show
        debasish83 Debasish Das added a comment - By the way I can close the JIRA if it is not related to broader interest of the community...
        Hide
        debasish83 Debasish Das added a comment - - edited

        There are some other interesting cases for large rank non-convex function but we will come to it once fixing PLSA using factorization but yes in all the formulation things break up like ALS and that's why we can distribute the solve to spark workers...If the objective function does not break like neural net (which is the natural extension for ALS) then we need parameter server type ideas for solver...

        Show
        debasish83 Debasish Das added a comment - - edited There are some other interesting cases for large rank non-convex function but we will come to it once fixing PLSA using factorization but yes in all the formulation things break up like ALS and that's why we can distribute the solve to spark workers...If the objective function does not break like neural net (which is the natural extension for ALS) then we need parameter server type ideas for solver...
        Hide
        debasish83 Debasish Das added a comment - - edited

        g(z) is not regularization...we support constraints like z>=0; lb <= z <= ub;1'z = s, z >=0;L1(z) for now...These are the same constraints I supported through QuadraticMinimizer for 2426. I already migrated ALS to use QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze release.

        I call it z since we are using splitting algorithms here for the solve (projection based or admm + proximal)...

        Sure for papers on global objective refer to any PLSA paper with matrix factorization. I personally like these 2 and I am focused on them:

        1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization Equation (2) and (3)

        2. The original PLSA paper from Hoffman et al.

        3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic Models for Collaborative Filtering

        4. Industry specific application:
        http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

        For large rank matrix factorization the requirement also come from sparse topics now which can easily range in ~ 10K...

        The idea can be implemented in the Sparse LDA JIRA as well https://issues.apache.org/jira/browse/SPARK-5564 and I asked Joseph K. Bradley if he thinks we should do it in LDA framework but I don't think we know which flow will scale better yet as the data moves from sparse from dense.

        With the factorization flow I will start to see results next week. These are the algorithm steps:

        1. minimize f(w,h*)
        s.t 1'w = 1, w >=0 (row constraints)

        2. minimize f(w*,h)
        s.t 0 <= h <= 1,

        3. Normalize each column in h

        Note that 2 and 3 is an approximation to the original matrix formulation but the column normalization makes the factor probabilistically well defined for next PLSA iteration.

        f(w,h*) is loglikelihood loss from PLSA paper.

        I will start to look into graphx based flow after that because in general that flow makes more sense for distributed nets where objective is no longer separable like ALS-WR / PLSA.

        Show
        debasish83 Debasish Das added a comment - - edited g(z) is not regularization...we support constraints like z>=0; lb <= z <= ub;1'z = s, z >=0;L1(z) for now...These are the same constraints I supported through QuadraticMinimizer for 2426. I already migrated ALS to use QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze release. I call it z since we are using splitting algorithms here for the solve (projection based or admm + proximal)... Sure for papers on global objective refer to any PLSA paper with matrix factorization. I personally like these 2 and I am focused on them: 1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization Equation (2) and (3) 2. The original PLSA paper from Hoffman et al. 3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic Models for Collaborative Filtering 4. Industry specific application: http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818 For large rank matrix factorization the requirement also come from sparse topics now which can easily range in ~ 10K... The idea can be implemented in the Sparse LDA JIRA as well https://issues.apache.org/jira/browse/SPARK-5564 and I asked Joseph K. Bradley if he thinks we should do it in LDA framework but I don't think we know which flow will scale better yet as the data moves from sparse from dense. With the factorization flow I will start to see results next week. These are the algorithm steps: 1. minimize f(w,h*) s.t 1'w = 1, w >=0 (row constraints) 2. minimize f(w*,h) s.t 0 <= h <= 1, 3. Normalize each column in h Note that 2 and 3 is an approximation to the original matrix formulation but the column normalization makes the factor probabilistically well defined for next PLSA iteration. f(w,h*) is loglikelihood loss from PLSA paper. I will start to look into graphx based flow after that because in general that flow makes more sense for distributed nets where objective is no longer separable like ALS-WR / PLSA.
        Hide
        mengxr Xiangrui Meng added a comment - - edited

        Debasish Das Please help me understand some details.

        > The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z)

        You call `g(z)` "constraints" but it looks like the regularization term in the objective function, and you didn't mention what `z` is.

        > ALM will be capable of solving the following problems: min f ( x ) + g (z)

        Are they sub-problems of the matrix factorization? If yes, could you also tell the global objective? For example, in ALS, the global objective is

        minimize \frac{1}[2} \sum_{ij \in \Omega} (r_{ij} - u_i^T v_j)^2
        

        and if we take alternating directions, the problem in each step is decoupled into many sub-problems (least squares).

        minimize \frac{1}{2} \sum_{j, ij \in \Omega} (r_{ij}  - u_i^T v_j)^2  (sub-problem for u_i)
        

        We can add the nonnegative constraints to the global objective, and then the sub-problems receive the same constraints. I can see other loss may work, but I cannot clearly see the benefits of using other losses, which usually make the problem much harder to solve. Any papers for reference?

        Another issue is dealing with very frequent items (https://issues.apache.org/jira/browse/SPARK-3735). We plan to assemble and send partial AtA directly. But this only works if the subproblems can be expressed using normal equation. I think it only applies to squared loss.

        > As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)...

        Any papers using rank ~10K?

        Show
        mengxr Xiangrui Meng added a comment - - edited Debasish Das Please help me understand some details. > The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) You call `g(z)` "constraints" but it looks like the regularization term in the objective function, and you didn't mention what `z` is. > ALM will be capable of solving the following problems: min f ( x ) + g (z) Are they sub-problems of the matrix factorization? If yes, could you also tell the global objective? For example, in ALS, the global objective is minimize \frac{1}[2} \sum_{ij \in \Omega} (r_{ij} - u_i^T v_j)^2 and if we take alternating directions, the problem in each step is decoupled into many sub-problems (least squares). minimize \frac{1}{2} \sum_{j, ij \in \Omega} (r_{ij} - u_i^T v_j)^2 (sub-problem for u_i) We can add the nonnegative constraints to the global objective, and then the sub-problems receive the same constraints. I can see other loss may work, but I cannot clearly see the benefits of using other losses, which usually make the problem much harder to solve. Any papers for reference? Another issue is dealing with very frequent items ( https://issues.apache.org/jira/browse/SPARK-3735 ). We plan to assemble and send partial AtA directly. But this only works if the subproblems can be expressed using normal equation. I think it only applies to squared loss. > As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... Any papers using rank ~10K?

          People

          • Assignee:
            Unassigned
            Reporter:
            debasish83 Debasish Das
            Shepherd:
            Xiangrui Meng
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 672h
              672h
              Remaining:
              Remaining Estimate - 672h
              672h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development