Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2426

Quadratic Minimization for MLlib ALS

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.4.0
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None

      Description

      Current ALS supports least squares and nonnegative least squares.

      I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems:

      1. ALS with bounds
      2. ALS with L1 regularization
      3. ALS with Equality constraint and bounds

      Initial runtime comparisons are presented at Spark Summit.

      http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark

      Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results.

      For integration the detailed plan is as follows:

      1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
      2. Integrate QuadraticMinimizer in mllib ALS

        Issue Links

          Activity

          Hide
          debasish83 Debasish Das added a comment - - edited

          Hi Xiangrui,

          The branch is ready for an initial review. I will do lot of clean-up this week.

          I need some advice on whether we should bring the additional ALS features first or integrate NNLS with QuadraticMinimizer so that we can handle large ranks as well.

          https://github.com/debasish83/spark/commits/qp-als

          optimization/QuadraticMinimizer.scala is the placeholder for all QuadraticMinimization.

          Right now we support 5 features:

          1. Least square
          2. Quadratic minimization with positivity
          3. Quadratic minimization with box : generalization of positivity
          4. Quadratic minimization with elastic net :L1 is at 0.99, elastic net control is not given to users
          5. Quadratic minimization with affine constraints and bounds

          There are lot many regularization in Proximal.scala which can be re-used in mllib updater...L1Updater in mllib is an example of Proximal algorithm...

          QuadraticMinimizer is optimized for direct solve right now (cholesky / lu based on problem we are solving)

          The CG core from Breeze will be used for iterative solve when ranks are high...I need a different variant of CG for Formulation 5 so Breeze CG is not sufficient for all the formulations this branch supports and needs to be extended..

          Right now I am experimenting with ADMM rho and lambda values so that the NNLS iterations are at par with Least square with positivity. The idea for rho and lambda tuning are the following:

          1. Derive an optimal value of lambda for quadratic problems, similar to idea of Nesterov's acceleration being used in algorithms like FISTA and accelerated ADMM from UCLA
          2. Derive rho from approximate min and max eigenvalues of gram matrix

          For Matlab based experiments within PDCO, ECOS(IPM), MOSEK and ADMM variants, ADMM is faster with producing result quality within 1e-4 of MOSEK. I will publish the numbers and the matlab script through the ECOS jnilib open source (GPL licensed). I did not add any of ECOS code here so that everything stays Apache.

          For topic modeling use-case, I expect to produce sparse coding results (L1 on product factors, L2 on user factors)

          Example runs:

          NMF:

          ./bin/spark-submit --total-executor-cores 4 --master spark://localhost:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.1.0-SNAPSHOT.jar --rank 20 --numIterations 10 --userConstraint POSITIVE --lambdaUser 0.065 --productConstraint POSITIVE --lambdaProduct 0.065 --kryo hdfs://localhost:8020/sandbox/movielens/

          Sparse coding:

          ./bin/spark-submit --total-executor-cores 4 --master spark://localhost:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.1.0-SNAPSHOT.jar --delimiter " " --rank 20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065 --productConstraint SPARSE --lambdaProduct 0.065 --kryo hdfs://localhost:8020/sandbox/movielens

          Robust PLSA with least square loss:

          ./bin/spark-submit --total-executor-cores 4 --master spark://localhost:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.1.0-SNAPSHOT.jar --delimiter " " --rank 20 --numIterations 10 --userConstraint EQUALITY --lambdaUser 0.065 --productConstraint EQUALITY --lambdaProduct 0.065 --kryo hdfs://localhost:8020/sandbox/movielens

          With this change, users can select to apply user and product specific constraint...basically positive factors for products (interpretability) and smooth for users to get more RMSE improvements.

          Thanks.
          Deb

          Show
          debasish83 Debasish Das added a comment - - edited Hi Xiangrui, The branch is ready for an initial review. I will do lot of clean-up this week. I need some advice on whether we should bring the additional ALS features first or integrate NNLS with QuadraticMinimizer so that we can handle large ranks as well. https://github.com/debasish83/spark/commits/qp-als optimization/QuadraticMinimizer.scala is the placeholder for all QuadraticMinimization. Right now we support 5 features: 1. Least square 2. Quadratic minimization with positivity 3. Quadratic minimization with box : generalization of positivity 4. Quadratic minimization with elastic net :L1 is at 0.99, elastic net control is not given to users 5. Quadratic minimization with affine constraints and bounds There are lot many regularization in Proximal.scala which can be re-used in mllib updater...L1Updater in mllib is an example of Proximal algorithm... QuadraticMinimizer is optimized for direct solve right now (cholesky / lu based on problem we are solving) The CG core from Breeze will be used for iterative solve when ranks are high...I need a different variant of CG for Formulation 5 so Breeze CG is not sufficient for all the formulations this branch supports and needs to be extended.. Right now I am experimenting with ADMM rho and lambda values so that the NNLS iterations are at par with Least square with positivity. The idea for rho and lambda tuning are the following: 1. Derive an optimal value of lambda for quadratic problems, similar to idea of Nesterov's acceleration being used in algorithms like FISTA and accelerated ADMM from UCLA 2. Derive rho from approximate min and max eigenvalues of gram matrix For Matlab based experiments within PDCO, ECOS(IPM), MOSEK and ADMM variants, ADMM is faster with producing result quality within 1e-4 of MOSEK. I will publish the numbers and the matlab script through the ECOS jnilib open source (GPL licensed). I did not add any of ECOS code here so that everything stays Apache. For topic modeling use-case, I expect to produce sparse coding results (L1 on product factors, L2 on user factors) Example runs: NMF: ./bin/spark-submit --total-executor-cores 4 --master spark://localhost:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.1.0-SNAPSHOT.jar --rank 20 --numIterations 10 --userConstraint POSITIVE --lambdaUser 0.065 --productConstraint POSITIVE --lambdaProduct 0.065 --kryo hdfs://localhost:8020/sandbox/movielens/ Sparse coding: ./bin/spark-submit --total-executor-cores 4 --master spark://localhost:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.1.0-SNAPSHOT.jar --delimiter " " --rank 20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065 --productConstraint SPARSE --lambdaProduct 0.065 --kryo hdfs://localhost:8020/sandbox/movielens Robust PLSA with least square loss: ./bin/spark-submit --total-executor-cores 4 --master spark://localhost:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.1.0-SNAPSHOT.jar --delimiter " " --rank 20 --numIterations 10 --userConstraint EQUALITY --lambdaUser 0.065 --productConstraint EQUALITY --lambdaProduct 0.065 --kryo hdfs://localhost:8020/sandbox/movielens With this change, users can select to apply user and product specific constraint...basically positive factors for products (interpretability) and smooth for users to get more RMSE improvements. Thanks. Deb
          Hide
          apachespark Apache Spark added a comment -

          User 'debasish83' has created a pull request for this issue:
          https://github.com/apache/spark/pull/2705

          Show
          apachespark Apache Spark added a comment - User 'debasish83' has created a pull request for this issue: https://github.com/apache/spark/pull/2705
          Hide
          mengxr Xiangrui Meng added a comment -

          Debasish Das Thanks for working on this feature! This is definitely lots of work. We need to figure out couple high-level questions before looking into the code:

          1. License. There are two files that requires special license: proximal, which ports cvxgrp/proximal (BSD) and QPMinimizer:

          ... distributed with Copyright (c) 2014, Debasish Das (Verizon), all rights reserved.
          

          Code contribution to Apache follows ICLA: http://www.apache.org/licenses/icla.txt . I'm not familiar with the terms. I saw

          Except for the license granted herein to the Foundation and recipients of
          software distributed by the Foundation, You reserve all right, title,
          and interest in and to Your Contributions.
          

          My understand is that if you want your code distributed with Apache License, we don't need special notice about your rights. Please check with Verizon's legal team to make sure they are okay with it. It would be really helpful If someone can explain in more details.

          2. Interface. I'm doing a refactoring of ALS (SPARK-3541). I hope we can decouple the solvers (LS, QP) from ALS. In

          https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala

          The subproblem is wrapped in a NormalEquation, which stores AtA, Atb, and n. A Cholesky solver takes a NormalEquation instance, solves it, and returns the solution. We can plug-in other solvers as long as NormalEquation provides all information we need. Does it apply to your use cases?

          For public APIs, we should restrict parameters to simple types. For example, constraint = "none" | "nonnegative" | "box". This is good for adding Python APIs. Those options should be sufficient for normal use cases. We can provide a developer API that allows advanced users to plug-in their own solvers. You can check the current proposal of parameters at SPARK-3530.

          3. Where to put the implementation? Including MLlib's NNLS, those solvers are for local problems. What sounds ideal to me is breeze.optimize, which already contains several optimization solvers and we use LBFGS implemented there and maybe OWLQN soon.

          4. This PR definitely needs some time to testing. The feature freeze deadline for v1.2 is Oct 31. I cannot promise time for code review given my current bandwidth. It would be great if you can share your MATLAB code (hopefully Octave compatible) and some performance results. So more developers can help test.

          Show
          mengxr Xiangrui Meng added a comment - Debasish Das Thanks for working on this feature! This is definitely lots of work. We need to figure out couple high-level questions before looking into the code: 1. License. There are two files that requires special license: proximal, which ports cvxgrp/proximal (BSD) and QPMinimizer: ... distributed with Copyright (c) 2014, Debasish Das (Verizon), all rights reserved. Code contribution to Apache follows ICLA: http://www.apache.org/licenses/icla.txt . I'm not familiar with the terms. I saw Except for the license granted herein to the Foundation and recipients of software distributed by the Foundation, You reserve all right, title, and interest in and to Your Contributions. My understand is that if you want your code distributed with Apache License, we don't need special notice about your rights. Please check with Verizon's legal team to make sure they are okay with it. It would be really helpful If someone can explain in more details. 2. Interface. I'm doing a refactoring of ALS ( SPARK-3541 ). I hope we can decouple the solvers (LS, QP) from ALS. In https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala The subproblem is wrapped in a NormalEquation, which stores AtA, Atb, and n. A Cholesky solver takes a NormalEquation instance, solves it, and returns the solution. We can plug-in other solvers as long as NormalEquation provides all information we need. Does it apply to your use cases? For public APIs, we should restrict parameters to simple types. For example, constraint = "none" | "nonnegative" | "box". This is good for adding Python APIs. Those options should be sufficient for normal use cases. We can provide a developer API that allows advanced users to plug-in their own solvers. You can check the current proposal of parameters at SPARK-3530 . 3. Where to put the implementation? Including MLlib's NNLS, those solvers are for local problems. What sounds ideal to me is breeze.optimize, which already contains several optimization solvers and we use LBFGS implemented there and maybe OWLQN soon. 4. This PR definitely needs some time to testing. The feature freeze deadline for v1.2 is Oct 31. I cannot promise time for code review given my current bandwidth. It would be great if you can share your MATLAB code (hopefully Octave compatible) and some performance results. So more developers can help test.
          Hide
          debasish83 Debasish Das added a comment -

          1. Xiangrui Meng Our legal was clear that Stanford and Verizon copyright should show up on the COPYRIGHT.txt file...I saw other company's copyrights and I did not think it will be a big issue...

          2. For the new interface, we have two more requirements: Convex loss function (supporting huber loss / hinge loss etc) and no explicit AtA construction since once we start scaling to 10000 factors for LSA then AtA construction will start to choke...Can I work on your branch ? https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala

          3. I agree to refactor the core solver including NNLS to breeze. That was the initial plan but since we wanted to test out the features in our internal datasets, integrating with mllib was faster. I am testing NNLS's CG implementation since as soon as explicit AtA construction is taken out, we need to rely on CG in-place of direct solvers...But I will refactor the solver out to breeze and that will take the copyright msgs to breeze as well.

          4. Let me add the Matlab scripts and point to the repository. ECOS and MOSEK will need Matlab to run. PDCO and Proximal variants will run fine on Octave. I am not sure if MOSEK is supported on Octave.

          Show
          debasish83 Debasish Das added a comment - 1. Xiangrui Meng Our legal was clear that Stanford and Verizon copyright should show up on the COPYRIGHT.txt file...I saw other company's copyrights and I did not think it will be a big issue... 2. For the new interface, we have two more requirements: Convex loss function (supporting huber loss / hinge loss etc) and no explicit AtA construction since once we start scaling to 10000 factors for LSA then AtA construction will start to choke...Can I work on your branch ? https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala 3. I agree to refactor the core solver including NNLS to breeze. That was the initial plan but since we wanted to test out the features in our internal datasets, integrating with mllib was faster. I am testing NNLS's CG implementation since as soon as explicit AtA construction is taken out, we need to rely on CG in-place of direct solvers...But I will refactor the solver out to breeze and that will take the copyright msgs to breeze as well. 4. Let me add the Matlab scripts and point to the repository. ECOS and MOSEK will need Matlab to run. PDCO and Proximal variants will run fine on Octave. I am not sure if MOSEK is supported on Octave.
          Hide
          srowen Sean Owen added a comment -

          Regarding licensing, if the code is BSD licensed then it does not require an entry in NOTICE file (it's a "Category A" license), and entries shouldn't be added to NOTICE unless required. I believe that in this case we will need to reproduce the text of the license in LICENSE since it will not be included otherwise from a Maven artifact. So I suggest: don't change NOTICE, and move the license in LICENSE up to the section where other licenses are reproduced in full. It's a complex issue but this is my best understanding of the right thing to do.

          Show
          srowen Sean Owen added a comment - Regarding licensing, if the code is BSD licensed then it does not require an entry in NOTICE file (it's a "Category A" license), and entries shouldn't be added to NOTICE unless required. I believe that in this case we will need to reproduce the text of the license in LICENSE since it will not be included otherwise from a Maven artifact. So I suggest: don't change NOTICE, and move the license in LICENSE up to the section where other licenses are reproduced in full. It's a complex issue but this is my best understanding of the right thing to do.
          Hide
          debasish83 Debasish Das added a comment -

          Xiangrui Meng I thought more on it and one of the reason we choose ADMM because QuadraticMinimizer is not designed to be a local algorithm....

          If it runs on Spark Master it will take a RDD...If it runs on Spark worker, it will take a H and c from x'Hx + c'x along with proximal operators...

          I will update the API and show some POCs that how this meta algorithm will add LBFGS/Truncated Newton as a core solver for x-solve for scalable version of matrix factorization where we don't want to create the H matrix explicitly ever...

          Truncated Newton is a better choice for the constraints we want to support...I am working on a variant of TRON and linear CG that's in breeze for the scalable version..Those are the building blocks I need...

          I am sure some of the code will move to Breeze. Proximal will definitely move to Breeze but QuadraticMinimizer will be refactored. It will really help if you can open up a PR on the new ALS design you have and we can work on it...

          Show
          debasish83 Debasish Das added a comment - Xiangrui Meng I thought more on it and one of the reason we choose ADMM because QuadraticMinimizer is not designed to be a local algorithm.... If it runs on Spark Master it will take a RDD...If it runs on Spark worker, it will take a H and c from x'Hx + c'x along with proximal operators... I will update the API and show some POCs that how this meta algorithm will add LBFGS/Truncated Newton as a core solver for x-solve for scalable version of matrix factorization where we don't want to create the H matrix explicitly ever... Truncated Newton is a better choice for the constraints we want to support...I am working on a variant of TRON and linear CG that's in breeze for the scalable version..Those are the building blocks I need... I am sure some of the code will move to Breeze. Proximal will definitely move to Breeze but QuadraticMinimizer will be refactored. It will really help if you can open up a PR on the new ALS design you have and we can work on it...
          Hide
          debasish83 Debasish Das added a comment - - edited

          Xiangrui Meng The matlab comparison scripts are open sourced over here:

          https://github.com/debasish83/ecos/blob/master/matlab/admm/qprandom.m
          https://github.com/debasish83/ecos/blob/master/matlab/pdco4/code/pdcotestQP.m

          The detailed comparisons are on the REAME.md. Please look at the section on Matlab comparisons.

          In a nutshell, for bounds MOSEK and ADMM are similar, for elastic net Proximal is 10X faster compared to MOSEK, for equality MOSEK is 2-3X faster than Proximal but both PDCO and ECOS produces much worse result as compared to ADMM. Accelerated ADMM also did not work as good as default ADMM. Increasing the over-relaxation parameter helps ADMM but I have not explored it yet.

          ADMM and PDCO are in Matlab but ECOS and MOSEK are both using mex files so they are expected to be more efficient.

          Next I will add the performance results of running positivity, box, sparse coding / regularized lsi and robust-plsa on MovieLens dataset and validate product recommendation using the MAP measure...In terms of RMSE, default < positive < sparse coding...

          What's the largest datasets LDA PRs are running? I would like to try that on sparse coding as well...From these papers sparse coding/RLSI should give results at par with LDA:

          https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf
          http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf

          The same randomized matrices can be generated and run in the PR as follows:

          ./bin/spark-class org.apache.spark.mllib.optimization.QuadraticMinimizer 1000 1 1.0 0.99

          rank=1000, equality=1.0 lambda=1.0 beta=0.99
          L1regularization = lambda*beta L2regularization = lambda*(1-beta)

          Generating randomized QPs with rank 1000 equalities 1
          sparseQp 88.423 ms iterations 45 converged true
          posQp 181.369 ms iterations 121 converged true
          boundsQp 175.733 ms iterations 121 converged true
          Qp Equality 2805.564 ms iterations 2230 converged true

          Show
          debasish83 Debasish Das added a comment - - edited Xiangrui Meng The matlab comparison scripts are open sourced over here: https://github.com/debasish83/ecos/blob/master/matlab/admm/qprandom.m https://github.com/debasish83/ecos/blob/master/matlab/pdco4/code/pdcotestQP.m The detailed comparisons are on the REAME.md. Please look at the section on Matlab comparisons. In a nutshell, for bounds MOSEK and ADMM are similar, for elastic net Proximal is 10X faster compared to MOSEK, for equality MOSEK is 2-3X faster than Proximal but both PDCO and ECOS produces much worse result as compared to ADMM. Accelerated ADMM also did not work as good as default ADMM. Increasing the over-relaxation parameter helps ADMM but I have not explored it yet. ADMM and PDCO are in Matlab but ECOS and MOSEK are both using mex files so they are expected to be more efficient. Next I will add the performance results of running positivity, box, sparse coding / regularized lsi and robust-plsa on MovieLens dataset and validate product recommendation using the MAP measure...In terms of RMSE, default < positive < sparse coding... What's the largest datasets LDA PRs are running? I would like to try that on sparse coding as well...From these papers sparse coding/RLSI should give results at par with LDA: https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf The same randomized matrices can be generated and run in the PR as follows: ./bin/spark-class org.apache.spark.mllib.optimization.QuadraticMinimizer 1000 1 1.0 0.99 rank=1000, equality=1.0 lambda=1.0 beta=0.99 L1regularization = lambda*beta L2regularization = lambda*(1-beta) Generating randomized QPs with rank 1000 equalities 1 sparseQp 88.423 ms iterations 45 converged true posQp 181.369 ms iterations 121 converged true boundsQp 175.733 ms iterations 121 converged true Qp Equality 2805.564 ms iterations 2230 converged true
          Hide
          debasish83 Debasish Das added a comment -

          Matlab comparisons of MOSEK, ECOS, PDCO and ADMM are over here:
          https://github.com/debasish83/ecos/blob/master/README.md

          MOSEK is available for research purposes. Let me know if there are issues in running the matlab scripts.

          Show
          debasish83 Debasish Das added a comment - Matlab comparisons of MOSEK, ECOS, PDCO and ADMM are over here: https://github.com/debasish83/ecos/blob/master/README.md MOSEK is available for research purposes. Let me know if there are issues in running the matlab scripts.
          Hide
          debasish83 Debasish Das added a comment -

          Refactored QuadraticMinimizer and NNLS from mllib optimization to breeze.optimize.quadratic
          https://github.com/scalanlp/breeze/pull/321
          I will update the PR as well but breeze latest depends on scala 2.11 but spark still uses 2.10
          All license and copyright information also moved to breeze. So for spark no changes to license/notice files.

          Show
          debasish83 Debasish Das added a comment - Refactored QuadraticMinimizer and NNLS from mllib optimization to breeze.optimize.quadratic https://github.com/scalanlp/breeze/pull/321 I will update the PR as well but breeze latest depends on scala 2.11 but spark still uses 2.10 All license and copyright information also moved to breeze. So for spark no changes to license/notice files.
          Hide
          apachespark Apache Spark added a comment -

          User 'debasish83' has created a pull request for this issue:
          https://github.com/apache/spark/pull/3221

          Show
          apachespark Apache Spark added a comment - User 'debasish83' has created a pull request for this issue: https://github.com/apache/spark/pull/3221
          Hide
          debasish83 Debasish Das added a comment - - edited

          With the MAP measures being added to examples.MovieLensALS through https://issues.apache.org/jira/browse/SPARK-4231 I compared the quality and runtime of the matrix completion formulations on MovieLens 1M dataset:

          Default: userConstraint L2, productConstraint L2 lambdaUser=lambdaProduct=0.065 rank=100 iterations 10

          Test RMSE = 0.8436480113821955.
          Test users 6038 MAP 0.05860164548002782

          Solver: Cholesky decomposition followed by forward-backward solves

          Per iteration runtime for baseline (solveTime in ms)

          14/11/19 17:37:06 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 362.813 Iters 0
          14/11/19 17:37:06 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 314.527 Iters 0
          14/11/19 17:37:06 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 265.75 Iters 0
          14/11/19 17:37:06 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 271.513 Iters 0
          14/11/19 17:37:09 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 370.177 Iters 0
          14/11/19 17:37:09 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 467.994 Iters 0
          14/11/19 17:37:09 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 511.894 Iters 0
          14/11/19 17:37:09 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 481.189 Iters 0

          NMF: userConstraint POSITIVE, productConstraint POSITIVE, userLambda=productLambda=0.065 L2 regularization

          Got 1000209 ratings from 6040 users on 3706 movies.
          Training: 800670, test: 199539.
          Quadratic minimization userConstraint POSITIVE productConstraint POSITIVE
          Test RMSE = 0.8435335132641906.
          Test users 6038 MAP 0.056361816590625446

          ALS iteration1 runtime:

          QuadraticMinimizer convergence profile:

          14/11/19 17:46:46 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 1936.281 Iters 73132
          14/11/19 17:46:46 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 1871.364 Iters 75219
          14/11/19 17:46:46 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 2067.735 Iters 73180
          14/11/19 17:46:46 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 2127.161 Iters 75546
          14/11/19 17:46:53 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 3813.923 Iters 193207
          14/11/19 17:46:54 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 3894.068 Iters 196882
          14/11/19 17:46:54 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 3875.915 Iters 193987
          14/11/19 17:46:54 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 3939.765 Iters 192471

          NNLS convergence profile:

          14/11/19 17:46:46 INFO ALS: NNLS solveTime 252.909 iters 7381
          14/11/19 17:46:46 INFO ALS: NNLS solveTime 256.803 iters 7740
          14/11/19 17:46:46 INFO ALS: NNLS solveTime 274.352 iters 7491
          14/11/19 17:46:46 INFO ALS: NNLS solveTime 272.971 iters 7664
          14/11/19 17:46:53 INFO ALS: NNLS solveTime 1487.262 iters 60338
          14/11/19 17:46:54 INFO ALS: NNLS solveTime 1472.742 iters 61321
          14/11/19 17:46:54 INFO ALS: NNLS solveTime 1489.863 iters 62228
          14/11/19 17:46:54 INFO ALS: NNLS solveTime 1494.192 iters 60489

          ALS iteration 10

          Quadratic Minimizer convergence profile:

          14/11/19 17:48:17 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 1082.056 Iters 53724
          14/11/19 17:48:17 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 1180.601 Iters 50593
          14/11/19 17:48:17 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 1106.131 Iters 53069
          14/11/19 17:48:17 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 1108.478 Iters 50895
          14/11/19 17:48:23 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 2262.193 Iters 116818
          14/11/19 17:48:23 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 2293.64 Iters 116026
          14/11/19 17:48:23 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 2241.491 Iters 116293
          14/11/19 17:48:23 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 2372.957 Iters 118391

          NNLS convergence profile:

          14/11/19 17:48:17 INFO ALS: NNLS solveTime 623.031 iters 21611
          14/11/19 17:48:17 INFO ALS: NNLS solveTime 553.493 iters 21732
          14/11/19 17:48:17 INFO ALS: NNLS solveTime 559.9 iters 22511
          14/11/19 17:48:17 INFO ALS: NNLS solveTime 556.654 iters 21330
          14/11/19 17:48:23 INFO ALS: NNLS solveTime 1672.582 iters 86006
          14/11/19 17:48:23 INFO ALS: NNLS solveTime 1703.221 iters 85824
          14/11/19 17:48:23 INFO ALS: NNLS solveTime 1826.252 iters 85403
          14/11/19 17:48:23 INFO ALS: NNLS solveTime 1753.859 iters 86559

          NNLS looks faster but the algorithms are not running same convergence criteria. I am following primal dual convergence similar to IP based solver which I feel can be optimized further. ABSTOL right now is at 1e-8 which gives MOSEK like quality for well conditioned gram matrices like what shows up in ALS.

          Sparse coding: userConstraint SMOOTH, productConstraint SPARSE, userLambda=0.065 productLambda=0.065 ElasticNet regularization

          Got 1000209 ratings from 6040 users on 3706 movies.
          Training: 800670, test: 199539.
          Quadratic minimization userConstraint SMOOTH productConstraint SPARSE
          Test RMSE = 0.886351402513496.
          Test users 6038 MAP 0.03141036089472268

          ALS iteration 1

          QuadraticMinimizer convergence profile:

          14/11/19 17:56:45 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 1641.588 Iters 63077
          14/11/19 17:56:46 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 1702.631 Iters 61617
          14/11/19 17:56:46 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 1802.781 Iters 66567
          14/11/19 17:56:46 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 1911.827 Iters 65827
          14/11/19 17:56:48 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 456.11 Iters 0
          14/11/19 17:56:48 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 404.214 Iters 0
          14/11/19 17:56:48 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 459.732 Iters 0
          14/11/19 17:56:49 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 409.18 Iters 0

          ALS iteration 10:

          QuadraticMinimizer convergence profile

          14/11/19 17:57:47 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 2135.688 Iters 125344
          14/11/19 17:57:47 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 2240.827 Iters 125399
          14/11/19 17:57:47 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 2202.971 Iters 126542
          14/11/19 17:57:48 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 2220.846 Iters 129391
          14/11/19 17:57:50 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 358.622 Iters 0
          14/11/19 17:57:50 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 352.825 Iters 0
          14/11/19 17:57:50 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 350.971 Iters 0
          14/11/19 17:57:50 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 298.721 Iters 0

          I will run Netflix dataset just to see if Sparse coding/POSITIVITY can improve MAP but on MovieLens1M looks like winner is the default formulation.

          QuadraticMinimizer supports lot more features than what we showed for recommendation datasets. Next set of experiments will focus on LDA datasets from https://issues.apache.org/jira/browse/SPARK-1405 for LSA comparisons.

          Show
          debasish83 Debasish Das added a comment - - edited With the MAP measures being added to examples.MovieLensALS through https://issues.apache.org/jira/browse/SPARK-4231 I compared the quality and runtime of the matrix completion formulations on MovieLens 1M dataset: Default: userConstraint L2, productConstraint L2 lambdaUser=lambdaProduct=0.065 rank=100 iterations 10 Test RMSE = 0.8436480113821955. Test users 6038 MAP 0.05860164548002782 Solver: Cholesky decomposition followed by forward-backward solves Per iteration runtime for baseline (solveTime in ms) 14/11/19 17:37:06 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 362.813 Iters 0 14/11/19 17:37:06 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 314.527 Iters 0 14/11/19 17:37:06 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 265.75 Iters 0 14/11/19 17:37:06 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 271.513 Iters 0 14/11/19 17:37:09 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 370.177 Iters 0 14/11/19 17:37:09 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 467.994 Iters 0 14/11/19 17:37:09 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 511.894 Iters 0 14/11/19 17:37:09 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 481.189 Iters 0 NMF: userConstraint POSITIVE, productConstraint POSITIVE, userLambda=productLambda=0.065 L2 regularization Got 1000209 ratings from 6040 users on 3706 movies. Training: 800670, test: 199539. Quadratic minimization userConstraint POSITIVE productConstraint POSITIVE Test RMSE = 0.8435335132641906. Test users 6038 MAP 0.056361816590625446 ALS iteration1 runtime: QuadraticMinimizer convergence profile: 14/11/19 17:46:46 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 1936.281 Iters 73132 14/11/19 17:46:46 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 1871.364 Iters 75219 14/11/19 17:46:46 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 2067.735 Iters 73180 14/11/19 17:46:46 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 2127.161 Iters 75546 14/11/19 17:46:53 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 3813.923 Iters 193207 14/11/19 17:46:54 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 3894.068 Iters 196882 14/11/19 17:46:54 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 3875.915 Iters 193987 14/11/19 17:46:54 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 3939.765 Iters 192471 NNLS convergence profile: 14/11/19 17:46:46 INFO ALS: NNLS solveTime 252.909 iters 7381 14/11/19 17:46:46 INFO ALS: NNLS solveTime 256.803 iters 7740 14/11/19 17:46:46 INFO ALS: NNLS solveTime 274.352 iters 7491 14/11/19 17:46:46 INFO ALS: NNLS solveTime 272.971 iters 7664 14/11/19 17:46:53 INFO ALS: NNLS solveTime 1487.262 iters 60338 14/11/19 17:46:54 INFO ALS: NNLS solveTime 1472.742 iters 61321 14/11/19 17:46:54 INFO ALS: NNLS solveTime 1489.863 iters 62228 14/11/19 17:46:54 INFO ALS: NNLS solveTime 1494.192 iters 60489 ALS iteration 10 Quadratic Minimizer convergence profile: 14/11/19 17:48:17 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 1082.056 Iters 53724 14/11/19 17:48:17 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 1180.601 Iters 50593 14/11/19 17:48:17 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 1106.131 Iters 53069 14/11/19 17:48:17 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 1108.478 Iters 50895 14/11/19 17:48:23 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 2262.193 Iters 116818 14/11/19 17:48:23 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 2293.64 Iters 116026 14/11/19 17:48:23 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 2241.491 Iters 116293 14/11/19 17:48:23 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 2372.957 Iters 118391 NNLS convergence profile: 14/11/19 17:48:17 INFO ALS: NNLS solveTime 623.031 iters 21611 14/11/19 17:48:17 INFO ALS: NNLS solveTime 553.493 iters 21732 14/11/19 17:48:17 INFO ALS: NNLS solveTime 559.9 iters 22511 14/11/19 17:48:17 INFO ALS: NNLS solveTime 556.654 iters 21330 14/11/19 17:48:23 INFO ALS: NNLS solveTime 1672.582 iters 86006 14/11/19 17:48:23 INFO ALS: NNLS solveTime 1703.221 iters 85824 14/11/19 17:48:23 INFO ALS: NNLS solveTime 1826.252 iters 85403 14/11/19 17:48:23 INFO ALS: NNLS solveTime 1753.859 iters 86559 NNLS looks faster but the algorithms are not running same convergence criteria. I am following primal dual convergence similar to IP based solver which I feel can be optimized further. ABSTOL right now is at 1e-8 which gives MOSEK like quality for well conditioned gram matrices like what shows up in ALS. Sparse coding: userConstraint SMOOTH, productConstraint SPARSE, userLambda=0.065 productLambda=0.065 ElasticNet regularization Got 1000209 ratings from 6040 users on 3706 movies. Training: 800670, test: 199539. Quadratic minimization userConstraint SMOOTH productConstraint SPARSE Test RMSE = 0.886351402513496. Test users 6038 MAP 0.03141036089472268 ALS iteration 1 QuadraticMinimizer convergence profile: 14/11/19 17:56:45 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 1641.588 Iters 63077 14/11/19 17:56:46 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 1702.631 Iters 61617 14/11/19 17:56:46 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 1802.781 Iters 66567 14/11/19 17:56:46 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 1911.827 Iters 65827 14/11/19 17:56:48 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 456.11 Iters 0 14/11/19 17:56:48 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 404.214 Iters 0 14/11/19 17:56:48 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 459.732 Iters 0 14/11/19 17:56:49 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 409.18 Iters 0 ALS iteration 10: QuadraticMinimizer convergence profile 14/11/19 17:57:47 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 2135.688 Iters 125344 14/11/19 17:57:47 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 2240.827 Iters 125399 14/11/19 17:57:47 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 2202.971 Iters 126542 14/11/19 17:57:48 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 2220.846 Iters 129391 14/11/19 17:57:50 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 358.622 Iters 0 14/11/19 17:57:50 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 352.825 Iters 0 14/11/19 17:57:50 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 350.971 Iters 0 14/11/19 17:57:50 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 298.721 Iters 0 I will run Netflix dataset just to see if Sparse coding/POSITIVITY can improve MAP but on MovieLens1M looks like winner is the default formulation. QuadraticMinimizer supports lot more features than what we showed for recommendation datasets. Next set of experiments will focus on LDA datasets from https://issues.apache.org/jira/browse/SPARK-1405 for LSA comparisons.
          Hide
          debasish83 Debasish Das added a comment -

          Actually on MovieLens dataset, I am getting good MAP numbers with EQUALITY constraint...The formulation is similar to PLSA but not exact:

          Avanesov Valeriy could you please help review if my understanding is correct here ?

          k \in

          {1...25}

          (if we running with rank as 25)

          Minimize \sum_i \sum_j ( r_ij - w_i*h_j) + lambda(||w_i||^2 + ||h_j||^2)
          s.t \sum_k w_ik = 1, w_ik >= 0
          \sum_k h_kj = 1, h_kj >= 0

          This is not quite the stochastic matrix factorization that this paper http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf talks about as PLSA needs the following constraint (I am reading it more) along with log-likelihood loss:

          For each k \sum_j h_kj = 1

          On MovieLens dataset I run the EQUALITY version as follows (rank=50, 5 iterations). More iterations does not improve it further.

          ./bin/spark-submit --total-executor-cores 4 --executor-memory 4g --driver-memory 1g --master spark://TUSCA09LMLVT00C.local:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 5 --userConstraint EQUALITY --lambdaUser 0.065 --productConstraint EQUALITY --lambdaProduct 0.065 --kryo --validateRecommendation hdfs://localhost:8020/sandbox/movielens/

          Got 1000209 ratings from 6040 users on 3706 movies.
          Training: 800670, test: 199539.
          Quadratic minimization userConstraint EQUALITY productConstraint EQUALITY
          Test RMSE = 1.6970509086529808.
          Test users 6038 MAP 0.09333309533803603

          So basically best MAP results come from this formulation. 2X improvement over default of 4.8%

          Xiangrui Meng Sean Owen it will be great if you guys can review the MAP calculation https://issues.apache.org/jira/browse/SPARK-4231 and help merge it to mllib. I am keen to understand if there are bugs in the calculation.

          This is a bit surprising to me since I have not finished the PLSA code (I am working on the bi-concave cost) as the paper points out and that means results can improve further. Note the degradation in RMSE.

          I will do runs with Netflix dataset but on our internal dataset (2M x 20K) trends look similar.

          Show
          debasish83 Debasish Das added a comment - Actually on MovieLens dataset, I am getting good MAP numbers with EQUALITY constraint...The formulation is similar to PLSA but not exact: Avanesov Valeriy could you please help review if my understanding is correct here ? k \in {1...25} (if we running with rank as 25) Minimize \sum_i \sum_j ( r_ij - w_i*h_j) + lambda(||w_i||^2 + ||h_j||^2) s.t \sum_k w_ik = 1, w_ik >= 0 \sum_k h_kj = 1, h_kj >= 0 This is not quite the stochastic matrix factorization that this paper http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf talks about as PLSA needs the following constraint (I am reading it more) along with log-likelihood loss: For each k \sum_j h_kj = 1 On MovieLens dataset I run the EQUALITY version as follows (rank=50, 5 iterations). More iterations does not improve it further. ./bin/spark-submit --total-executor-cores 4 --executor-memory 4g --driver-memory 1g --master spark://TUSCA09LMLVT00C.local:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 5 --userConstraint EQUALITY --lambdaUser 0.065 --productConstraint EQUALITY --lambdaProduct 0.065 --kryo --validateRecommendation hdfs://localhost:8020/sandbox/movielens/ Got 1000209 ratings from 6040 users on 3706 movies. Training: 800670, test: 199539. Quadratic minimization userConstraint EQUALITY productConstraint EQUALITY Test RMSE = 1.6970509086529808. Test users 6038 MAP 0.09333309533803603 So basically best MAP results come from this formulation. 2X improvement over default of 4.8% Xiangrui Meng Sean Owen it will be great if you guys can review the MAP calculation https://issues.apache.org/jira/browse/SPARK-4231 and help merge it to mllib. I am keen to understand if there are bugs in the calculation. This is a bit surprising to me since I have not finished the PLSA code (I am working on the bi-concave cost) as the paper points out and that means results can improve further. Note the degradation in RMSE. I will do runs with Netflix dataset but on our internal dataset (2M x 20K) trends look similar.
          Hide
          acopich Valeriy Avanesov added a comment - - edited

          I'm not sure if I understand your question...

          As far as I can see, w_i stands for a row of the matrix w and h_j stands for a column of the matrix h.

          \sum_i \sum_j ( r_ij - w_i*h_j) – is not a matrix norm. Probably, you either miss abs or square – \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - w_i*h_j)^2
          It looks like l2 regularized stochastic matrix decomposition with respect to Frobenius (or l1) norm. But I don't understand why do you consider k optimization problems (do you? What does k \in

          {1 ... 25}

          stand for?).

          Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows

          Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
          under non-negativeness and normalization constraints.

          ||.|| stands for Frobenius norm (or l1).

          By the way: is the matrix of ranks r stochastic? Stochastic matrix decomposition doesn't seem reasonable if it's not.

          Show
          acopich Valeriy Avanesov added a comment - - edited I'm not sure if I understand your question... As far as I can see, w_i stands for a row of the matrix w and h_j stands for a column of the matrix h. \sum_i \sum_j ( r_ij - w_i*h_j) – is not a matrix norm. Probably, you either miss abs or square – \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - w_i*h_j)^2 It looks like l2 regularized stochastic matrix decomposition with respect to Frobenius (or l1) norm. But I don't understand why do you consider k optimization problems (do you? What does k \in {1 ... 25} stand for?). Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ||.|| stands for Frobenius norm (or l1). By the way: is the matrix of ranks r stochastic? Stochastic matrix decomposition doesn't seem reasonable if it's not.
          Hide
          debasish83 Debasish Das added a comment -

          I meant \sum_j ( r_ij - w_i*h_j)^2...what's the normalization constraint ? Each row of W should sum upto 1 and each column of H should sum upto 1 with positivity ? That is similar to PLSA right except that PLSA will have a bi-concave loss...

          Show
          debasish83 Debasish Das added a comment - I meant \sum_j ( r_ij - w_i*h_j)^2...what's the normalization constraint ? Each row of W should sum upto 1 and each column of H should sum upto 1 with positivity ? That is similar to PLSA right except that PLSA will have a bi-concave loss...
          Hide
          debasish83 Debasish Das added a comment -

          Xiangrui Meng as per our discussion, QuadraticMinimizer and NNLS are both added to breeze and updated with breeze DenseMatrix and DenseVector...Inside breeze I did some interesting comparisons and that motivated me to port NNLS to breeze as well...I added all the testcases for QuadraticMinimizer and NNLS as well based on my experiments with MovieLens dataset...

          Here is the PR: https://github.com/scalanlp/breeze/pull/321

          To run the Quadratic programming variants in breeze:
          runMain breeze.optimize.quadratic.QuadraticMinimizer 100 1 0.1 0.99
          regParam = 0.1, beta = 0.99 is Elastic Net parameter

          It will randomly generate quadratic problems with 100 variables, 1 equality constraint and lower/upper bounds. This format is similar to PDCO QP generator (please look into my Matlab examples)

          0.5x'Hx + c'x
          s.t Ax = B,
          lb <= x <= ub

          1. Unconstrained minimization: breeze luSolve, cg and qp(dposv added to breeze through this PR)

          Minimize 0.5x'Hx + c'x

          qp - lu norm 4.312577233496585E-10 max-norm 1.3842793578078272E-10
          cg - lu norm 4.167925029822007E-7 max-norm 1.0053204402282745E-7
          dim 100 lu 86.007 qp 41.56 cg 102.627
          qp - lu norm 4.267891623199082E-8 max-norm 6.681460718027665E-9
          cg - lu norm 1.94497623480055E-7 max-norm 2.6288773824489908E-8
          dim 500 lu 169.993 qp 78.001 cg 443.044

          qp is faster than cg for smaller dimensions as expected. I also tried unconstrained BFGS but the results were not good. We are looking into it.

          2. Elastic Net formulation: 0.5 x'Hx + c'x + (1-beta)*L2 + beta*regParam*L1

          beta = 0.99 Strong L1 regParam=0.1

          owlqn - sparseqp norm 0.1653200701235298 inf-norm 0.051855911945906996
          sparseQp 61.948 ms iters 227 owlqn 928.11 ms

          beta = 0.5 average L1 regParam=0.1

          owlqn - sparseqp norm 0.15823773098501168 inf-norm 0.035153837685728107
          sparseQp 69.934 ms iters 353 owlqn 882.104 ms

          beta = 0.01 mostly BFGS regParam=0.1

          owlqn - sparseqp norm 0.17950035092790165 inf-norm 0.04718697692014828
          sparseQp 80.411 ms iters 580 owlqn 988.313 ms

          ADMM based proximal formulation is faster for smaller dimension. Even as I scale dimension, I notice similar behavior that owlqn is taking longer to converge and results are not same. Look for example in dim = 500 case:

          owlqn - sparseqp norm 10.946326189397649 inf-norm 1.412726586317294
          sparseQp 830.593 ms iters 2417 owlqn 19848.932 ms

          I validated ADMM through Matlab scripts so there is something funky going on in OWLQN.

          3. NNLS formulation: 0.5 x'Hx + c'x s.t x >= 0

          Here are compared ADMM based proximal formulation with CG based projected gradient in NNLS. NNLS converges much nicer but the convergence criteria does not look same as breeze CG but they should be same.

          For now I ported it to breeze and we can call NNLS for x >= 0 and QuadraticMinimizer for other formulations

          dim = 100 posQp 16.367 ms iters 284 nnls 8.854 ms iters 107
          dim = 500 posQp 303.184 ms iters 950 nnls 183.543 ms iters 517

          NNLS on average looks 2X faster !

          4. Bounds formulation: 0.5x'Hx + c'x s.t lb <= x <= ub
          Validated through Matlab scripts above. Here are the runtime numbers:

          dim = 100 boundsQp 15.654 ms iters 284 converged true
          dim= 500 boundsQp 311.613 ms iters 950 converged true

          5. Equality and positivity: 0.5 x'Hx + c'x s.t \sum_i x_i = 1, x_i >=0
          Validated through Matlab scripts above. Here are the runtime numbers:

          dim = 100 Qp Equality 13.64 ms iters 184 converged true
          dim = 500 Qp Equality 278.525 ms iters 890 converged true

          With this change all copyrights are moved to breeze. Once it merges, I will update the Spark PR. With this change we can move ALS code to Breeze DenseMatrix and DenseVector as well....

          My focus next will be to get a Truncated Newton running for convex cost since convex cost is required for PLSA, SVM and Neural Net formulations...

          I am still puzzled that why BFGS/OWLQN is not working well for the unconstrained case/L1 optimization. If TRON works well for unconstrained case, that's what I will use for NonlinearMinimizer. I am looking more into it.

          Show
          debasish83 Debasish Das added a comment - Xiangrui Meng as per our discussion, QuadraticMinimizer and NNLS are both added to breeze and updated with breeze DenseMatrix and DenseVector...Inside breeze I did some interesting comparisons and that motivated me to port NNLS to breeze as well...I added all the testcases for QuadraticMinimizer and NNLS as well based on my experiments with MovieLens dataset... Here is the PR: https://github.com/scalanlp/breeze/pull/321 To run the Quadratic programming variants in breeze: runMain breeze.optimize.quadratic.QuadraticMinimizer 100 1 0.1 0.99 regParam = 0.1, beta = 0.99 is Elastic Net parameter It will randomly generate quadratic problems with 100 variables, 1 equality constraint and lower/upper bounds. This format is similar to PDCO QP generator (please look into my Matlab examples) 0.5x'Hx + c'x s.t Ax = B, lb <= x <= ub 1. Unconstrained minimization: breeze luSolve, cg and qp(dposv added to breeze through this PR) Minimize 0.5x'Hx + c'x qp - lu norm 4.312577233496585E-10 max-norm 1.3842793578078272E-10 cg - lu norm 4.167925029822007E-7 max-norm 1.0053204402282745E-7 dim 100 lu 86.007 qp 41.56 cg 102.627 qp - lu norm 4.267891623199082E-8 max-norm 6.681460718027665E-9 cg - lu norm 1.94497623480055E-7 max-norm 2.6288773824489908E-8 dim 500 lu 169.993 qp 78.001 cg 443.044 qp is faster than cg for smaller dimensions as expected. I also tried unconstrained BFGS but the results were not good. We are looking into it. 2. Elastic Net formulation: 0.5 x'Hx + c'x + (1-beta)*L2 + beta*regParam*L1 beta = 0.99 Strong L1 regParam=0.1 owlqn - sparseqp norm 0.1653200701235298 inf-norm 0.051855911945906996 sparseQp 61.948 ms iters 227 owlqn 928.11 ms beta = 0.5 average L1 regParam=0.1 owlqn - sparseqp norm 0.15823773098501168 inf-norm 0.035153837685728107 sparseQp 69.934 ms iters 353 owlqn 882.104 ms beta = 0.01 mostly BFGS regParam=0.1 owlqn - sparseqp norm 0.17950035092790165 inf-norm 0.04718697692014828 sparseQp 80.411 ms iters 580 owlqn 988.313 ms ADMM based proximal formulation is faster for smaller dimension. Even as I scale dimension, I notice similar behavior that owlqn is taking longer to converge and results are not same. Look for example in dim = 500 case: owlqn - sparseqp norm 10.946326189397649 inf-norm 1.412726586317294 sparseQp 830.593 ms iters 2417 owlqn 19848.932 ms I validated ADMM through Matlab scripts so there is something funky going on in OWLQN. 3. NNLS formulation: 0.5 x'Hx + c'x s.t x >= 0 Here are compared ADMM based proximal formulation with CG based projected gradient in NNLS. NNLS converges much nicer but the convergence criteria does not look same as breeze CG but they should be same. For now I ported it to breeze and we can call NNLS for x >= 0 and QuadraticMinimizer for other formulations dim = 100 posQp 16.367 ms iters 284 nnls 8.854 ms iters 107 dim = 500 posQp 303.184 ms iters 950 nnls 183.543 ms iters 517 NNLS on average looks 2X faster ! 4. Bounds formulation: 0.5x'Hx + c'x s.t lb <= x <= ub Validated through Matlab scripts above. Here are the runtime numbers: dim = 100 boundsQp 15.654 ms iters 284 converged true dim= 500 boundsQp 311.613 ms iters 950 converged true 5. Equality and positivity: 0.5 x'Hx + c'x s.t \sum_i x_i = 1, x_i >=0 Validated through Matlab scripts above. Here are the runtime numbers: dim = 100 Qp Equality 13.64 ms iters 184 converged true dim = 500 Qp Equality 278.525 ms iters 890 converged true With this change all copyrights are moved to breeze. Once it merges, I will update the Spark PR. With this change we can move ALS code to Breeze DenseMatrix and DenseVector as well.... My focus next will be to get a Truncated Newton running for convex cost since convex cost is required for PLSA, SVM and Neural Net formulations... I am still puzzled that why BFGS/OWLQN is not working well for the unconstrained case/L1 optimization. If TRON works well for unconstrained case, that's what I will use for NonlinearMinimizer. I am looking more into it.
          Hide
          acopich Valeriy Avanesov added a comment -

          > what's the normalization constraint ? Each row of W should sum upto 1 and each column of H should sum upto 1 with positivity ?
          Yes.

          > That is similar to PLSA right except that PLSA will have a bi-concave loss...
          There's a completely different loss... BTW, we've used a factorisation with the loss you've described as an initial approximation for PLSA. It gave a significant speed-up.

          Show
          acopich Valeriy Avanesov added a comment - > what's the normalization constraint ? Each row of W should sum upto 1 and each column of H should sum upto 1 with positivity ? Yes. > That is similar to PLSA right except that PLSA will have a bi-concave loss... There's a completely different loss... BTW, we've used a factorisation with the loss you've described as an initial approximation for PLSA. It gave a significant speed-up.
          Hide
          debasish83 Debasish Das added a comment -

          Avanesov Valeriy I got good MAP results on recommendation datasets with the approximated PLSA formulation. I did not get time to compare that formulation with Gibbs sampling based LDA PR: https://issues.apache.org/jira/browse/SPARK-1405 yet. Did you compare them ?

          Show
          debasish83 Debasish Das added a comment - Avanesov Valeriy I got good MAP results on recommendation datasets with the approximated PLSA formulation. I did not get time to compare that formulation with Gibbs sampling based LDA PR: https://issues.apache.org/jira/browse/SPARK-1405 yet. Did you compare them ?
          Hide
          debasish83 Debasish Das added a comment -

          Xiangrui Meng Shuo Xiang David is out in Feb and I am not sure if we can cut a breeze release with the code. I refactored NNLS to breeze.optimize.linear due to its similarity to CG core. Proximal algorithms and QuadraticMinimizer are refactored to breeze.optimize.proximal. It will be great if you could also review the PR https://github.com/scalanlp/breeze/pull/321.

          With this solver added to Breeze I am ready to add in ALS modifications to Spark. The test-cases for default ALS and nnls runs fine with my Spark PR. I need to add appropriate test-cases for sparse coding and least square loss with lsa constraints as explained above.

          Should I add them to ml.als or mllib.als since we have now two codebases ? My current PR will merge fine with mllib.als but not with ml.als. I see there is a CholeskySolver but all those features are supported in breeze.optimize.proximal.QuadraticMinimizer.

          Show
          debasish83 Debasish Das added a comment - Xiangrui Meng Shuo Xiang David is out in Feb and I am not sure if we can cut a breeze release with the code. I refactored NNLS to breeze.optimize.linear due to its similarity to CG core. Proximal algorithms and QuadraticMinimizer are refactored to breeze.optimize.proximal. It will be great if you could also review the PR https://github.com/scalanlp/breeze/pull/321 . With this solver added to Breeze I am ready to add in ALS modifications to Spark. The test-cases for default ALS and nnls runs fine with my Spark PR. I need to add appropriate test-cases for sparse coding and least square loss with lsa constraints as explained above. Should I add them to ml.als or mllib.als since we have now two codebases ? My current PR will merge fine with mllib.als but not with ml.als. I see there is a CholeskySolver but all those features are supported in breeze.optimize.proximal.QuadraticMinimizer.
          Hide
          debasish83 Debasish Das added a comment -

          Xiangrui Meng NNLS and QuadraticMinimizer are both merged to Breeze....I will migrate ml.recommendation.ALS accordingly...

          Show
          debasish83 Debasish Das added a comment - Xiangrui Meng NNLS and QuadraticMinimizer are both merged to Breeze....I will migrate ml.recommendation.ALS accordingly...
          Hide
          apachespark Apache Spark added a comment -

          User 'debasish83' has created a pull request for this issue:
          https://github.com/apache/spark/pull/5005

          Show
          apachespark Apache Spark added a comment - User 'debasish83' has created a pull request for this issue: https://github.com/apache/spark/pull/5005
          Hide
          debasish83 Debasish Das added a comment -

          Valeriy Avanesov "There's a completely different loss... BTW, we've used a factorisation with the loss you've described as an initial approximation for PLSA. It gave a significant speed-up." Could you help adding some testcases and driver for the PLSA approximation ? the PR https://github.com/apache/spark/pull/3221 has now the LSA constraints and least square loss...

          Idea here is to do probability simplex on user side, bounds on the item side and normalization on item columns at each ALS iteration...The MAP loss is tracked through https://issues.apache.org/jira/browse/SPARK-6323 but the solve idea will be very similar as I mentioned before and so we can re-use the flow test-cases...We can discuss more on the PR...It will be great if you can help add examples.mllib.PLSA as well that will driver both PLSA through ALS and ALM (alternating MAP loss optimization)...

          Show
          debasish83 Debasish Das added a comment - Valeriy Avanesov "There's a completely different loss... BTW, we've used a factorisation with the loss you've described as an initial approximation for PLSA. It gave a significant speed-up." Could you help adding some testcases and driver for the PLSA approximation ? the PR https://github.com/apache/spark/pull/3221 has now the LSA constraints and least square loss... Idea here is to do probability simplex on user side, bounds on the item side and normalization on item columns at each ALS iteration...The MAP loss is tracked through https://issues.apache.org/jira/browse/SPARK-6323 but the solve idea will be very similar as I mentioned before and so we can re-use the flow test-cases...We can discuss more on the PR...It will be great if you can help add examples.mllib.PLSA as well that will driver both PLSA through ALS and ALM (alternating MAP loss optimization)...
          Hide
          debasish83 Debasish Das added a comment - - edited

          Valeriy Avanesov From your comment before "Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows
          Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
          under non-negativeness and normalization constraints.
          .", could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 Is not there is log term that multiplies with R to make it a KL divergence loss ? May be the log term can removed under non-negative and normalization constraints ? @mengxr any ideas here ? If we can do that we can target KL divergence loss from Lee's paper: http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf

          For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much slower we get compared to stochastic matrix decomposition using ALS. MAP loss looks like a strong contender to LDA and can natively handle counts (does not need regression style datasets which is difficult to get in practical setup where people normally don't give any rating and satisfaction should be infered from viewing time etc)

          Show
          debasish83 Debasish Das added a comment - - edited Valeriy Avanesov From your comment before "Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. .", could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 Is not there is log term that multiplies with R to make it a KL divergence loss ? May be the log term can removed under non-negative and normalization constraints ? @mengxr any ideas here ? If we can do that we can target KL divergence loss from Lee's paper: http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323 . I am very curious how much slower we get compared to stochastic matrix decomposition using ALS. MAP loss looks like a strong contender to LDA and can natively handle counts (does not need regression style datasets which is difficult to get in practical setup where people normally don't give any rating and satisfaction should be infered from viewing time etc)
          Hide
          debasish83 Debasish Das added a comment -

          Xiangrui Meng Should I add the PR to spark packages and close the JIRA ? The main contribution was to add sparsity constraints (L1 and probability simplex) to user and product factors in implicit and explicit feedback factorization and interested users can use the features from spark packages if they need...Later if there is community interest, we can pull it in to master ALS ?

          Show
          debasish83 Debasish Das added a comment - Xiangrui Meng Should I add the PR to spark packages and close the JIRA ? The main contribution was to add sparsity constraints (L1 and probability simplex) to user and product factors in implicit and explicit feedback factorization and interested users can use the features from spark packages if they need...Later if there is community interest, we can pull it in to master ALS ?

            People

            • Assignee:
              debasish83 Debasish Das
              Reporter:
              debasish83 Debasish Das
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 504h
                504h
                Remaining:
                Remaining Estimate - 504h
                504h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development