Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.5
    • Labels:
      None

      Description

      As Mahout is currently lacking a distributed collaborative filtering algorithm that uses matrix factorization, I spent some time reading through a couple of the Netflix papers and stumbled upon the "Large-scale Parallel Collaborative Filtering for the Netflix Prize" available at http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf.

      It describes a parallel algorithm that uses "Alternating-Least-Squares with Weighted-λ-Regularization" to factorize the preference-matrix and gives some insights on how the authors distributed the computation using Matlab.

      It seemed to me that this approach could also easily be parallelized using Map/Reduce, so I sat down and created a prototype version. I'm not really sure I got the mathematical details correct (they need some optimization anyway), but I wanna put up my prototype implementation here per Yonik's law of patches.

      Maybe someone has the time and motivation to work a little on this with me. It would be great if someone could validate the approach taken (I'm willing to help as the code might not be intuitive to read) and could try to factorize some test data and give feedback then.

      1. logs.zip
        2.60 MB
        Danny Bickson
      2. MAHOUT-452.patch
        45 kB
        Sebastian Schelter
      3. MAHOUT-542-2.patch
        38 kB
        Sebastian Schelter
      4. MAHOUT-542-3.patch
        52 kB
        Sebastian Schelter
      5. MAHOUT-542-4.patch
        8 kB
        Danny Bickson
      6. MAHOUT-542-5.patch
        47 kB
        Sebastian Schelter
      7. MAHOUT-542-6.patch
        87 kB
        Sebastian Schelter

        Activity

        Hide
        tdunning Ted Dunning added a comment -

        Sebastian,

        Nice idea. Can you say how it relates to Elkan and Menon's paper at http://arxiv.org/abs/1006.2156

        The core of their algorithm is a matrix factorization, but they seem to get away with very few latent factors per user and are able to handle side information very nicely.

        So it seems on the surface that these methods should be related. One interesting difference is the use of a link function and the handling of ordinal target variables in Elkan and Menon's work.

        Show
        tdunning Ted Dunning added a comment - Sebastian, Nice idea. Can you say how it relates to Elkan and Menon's paper at http://arxiv.org/abs/1006.2156 The core of their algorithm is a matrix factorization, but they seem to get away with very few latent factors per user and are able to handle side information very nicely. So it seems on the surface that these methods should be related. One interesting difference is the use of a link function and the handling of ordinal target variables in Elkan and Menon's work.
        Hide
        ssc Sebastian Schelter added a comment -

        Hi Ted,

        I read through that paper a while ago when we exchanged ideas for Mahout 0.5 on the mailing list and to be honest I didn't really get the mathematical details. Nevertheless I understood that its possibilities are for superior to what we currently have and also to the approach described in the Netflix paper mentioned above, mainly because of the ability to handle side information and nominal values as you already mentioned. I think the paper does not describe a parallelization approach to the algorithm, though I'm not sure whether this is even necessary for it.

        But I had the prototype code attached in the patch ready before we had that discussion and I have the hope that it could be finished with only a little input from someone else so I decided to put it up here. I'd have no problem dropping this here though when MAHOUT-525 is done and it turns out that putting work in there would give a much nicer recommender for Mahout.

        Show
        ssc Sebastian Schelter added a comment - Hi Ted, I read through that paper a while ago when we exchanged ideas for Mahout 0.5 on the mailing list and to be honest I didn't really get the mathematical details. Nevertheless I understood that its possibilities are for superior to what we currently have and also to the approach described in the Netflix paper mentioned above, mainly because of the ability to handle side information and nominal values as you already mentioned. I think the paper does not describe a parallelization approach to the algorithm, though I'm not sure whether this is even necessary for it. But I had the prototype code attached in the patch ready before we had that discussion and I have the hope that it could be finished with only a little input from someone else so I decided to put it up here. I'd have no problem dropping this here though when MAHOUT-525 is done and it turns out that putting work in there would give a much nicer recommender for Mahout.
        Hide
        tdunning Ted Dunning added a comment -

        Sebastian,

        I very much did not want to decrease your very commendable momentum. Getting a parallel version in place for comparison is very valuable.

        I won't be able to look at your code due to my current constraints, but I think your general direction is very good.

        Show
        tdunning Ted Dunning added a comment - Sebastian, I very much did not want to decrease your very commendable momentum. Getting a parallel version in place for comparison is very valuable. I won't be able to look at your code due to my current constraints, but I think your general direction is very good.
        Hide
        ssc Sebastian Schelter added a comment -

        An updated version of the patch. I fixed a small bug, added more tests and polished the code a little.

        The distributed matrix factorization works fine now on a toy example. The next steps will be to use real data and do some holdout tests.

        Show
        ssc Sebastian Schelter added a comment - An updated version of the patch. I fixed a small bug, added more tests and polished the code a little. The distributed matrix factorization works fine now on a toy example. The next steps will be to use real data and do some holdout tests.
        Hide
        ssc Sebastian Schelter added a comment -

        Thanks for the input so far Ted and Dimitriy.

        Here is an updated patch that does not address the issue of automatically learning lambda but provides some simple tools to evaluate the predicition quality of the factorization manually.

        I ran some local tests against the Movielens 1M dataset on my notebook:

        # downloaded and converted the movielens 1M dataset to mahout's common format for ratings
        cat /path/to/ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > /path/to/ratings.csv
        
        # create a 90% percent training set and a 10% probe set
        bin/mahout splitDataset --input /path/to/ratings.csv --output /tmp/dataset --trainingPercentage 0.9 --probePercentage 0.1
        
        # run distributed ALS-WR to factorize the rating matrix based on the training set 
        bin/mahout parallelALS --input /tmp/dataset/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065
        
        # measure the error of the predictions against the probe set
        bin/mahout evaluateALS --probes /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/
        

        Gave RMSE of 0.8564062387241173 and MAE of 0.6791075767551951 in a testrun.

        Unfortunately I don't have a cluster available currently to test this so I couldn't use the Netflix dataset....

        I still don't see how to automatically learn lambda yet without running lot's of subsequent M/R jobs...

        Show
        ssc Sebastian Schelter added a comment - Thanks for the input so far Ted and Dimitriy. Here is an updated patch that does not address the issue of automatically learning lambda but provides some simple tools to evaluate the predicition quality of the factorization manually. I ran some local tests against the Movielens 1M dataset on my notebook: # downloaded and converted the movielens 1M dataset to mahout's common format for ratings cat /path/to/ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > /path/to/ratings.csv # create a 90% percent training set and a 10% probe set bin/mahout splitDataset --input /path/to/ratings.csv --output /tmp/dataset --trainingPercentage 0.9 --probePercentage 0.1 # run distributed ALS-WR to factorize the rating matrix based on the training set bin/mahout parallelALS --input /tmp/dataset/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 # measure the error of the predictions against the probe set bin/mahout evaluateALS --probes /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/ Gave RMSE of 0.8564062387241173 and MAE of 0.6791075767551951 in a testrun. Unfortunately I don't have a cluster available currently to test this so I couldn't use the Netflix dataset.... I still don't see how to automatically learn lambda yet without running lot's of subsequent M/R jobs...
        Hide
        tdunning Ted Dunning added a comment -

        I still don't see how to automatically learn lambda yet without running lot's of subsequent M/R jobs...

        Can you compute factorizations for multiple values of lambda in one go and then evaluate all of them in one pass?

        This would require that parallelALS accept a list of lambdas and produce multiple outputs
        It would also require that evaluateALS accept multiple models as well.

        It should take about the same amount of time to test against lots of models as it does to test against a single model. The distributed ALS might exhibit the same properties.

        Show
        tdunning Ted Dunning added a comment - I still don't see how to automatically learn lambda yet without running lot's of subsequent M/R jobs... Can you compute factorizations for multiple values of lambda in one go and then evaluate all of them in one pass? This would require that parallelALS accept a list of lambdas and produce multiple outputs It would also require that evaluateALS accept multiple models as well. It should take about the same amount of time to test against lots of models as it does to test against a single model. The distributed ALS might exhibit the same properties.
        Hide
        ssc Sebastian Schelter added a comment -

        Can you compute factorizations for multiple values of lambda in one go and then evaluate all of them in one pass?

        This would require that parallelALS accept a list of lambdas and produce multiple outputs
        It would also require that evaluateALS accept multiple models as well.

        I don't think this could work with the distributed implementation as each iteration needs to see the feature vectors that have been computed in previous iteration with the same lambda value, so I don't see how to concurrently compute the values for several lambdas. My evaluation code currently depends on the feature matrices fitting into memory which might not always be the case, which could be another bottleneck.

        However in a non-distributed implementation this approach could work, would it help to pick a sample from the input data, try to find a near-optimal lambda on that in a non-distributed way and use that for the distributed computation too? Not sure on this.

        Another issue I'm currently stuck on is how to compute the recommendations after the factorization. The rating matrix is factorized in the feature matrices U and M for users and items and a single rating can easily predicted by multiplying the feature vectors of the user and the item. If we wanna compute batch recommendations for all users we need to find an intelligent way to select "candidate items" to recommend for each users as we can't simply compute t(U) * M because those are dense matrices.

        Our non-distributed SVDRecommender uses simple cooccurrence to identify those candidate items, so one way I could think of would be to use RowSimilarityJob to find cooccurring items and foreach user to only compute his ratings for items that cooccur with preferred ones. Not sure either if this is the best way to do this.

        Show
        ssc Sebastian Schelter added a comment - Can you compute factorizations for multiple values of lambda in one go and then evaluate all of them in one pass? This would require that parallelALS accept a list of lambdas and produce multiple outputs It would also require that evaluateALS accept multiple models as well. I don't think this could work with the distributed implementation as each iteration needs to see the feature vectors that have been computed in previous iteration with the same lambda value, so I don't see how to concurrently compute the values for several lambdas. My evaluation code currently depends on the feature matrices fitting into memory which might not always be the case, which could be another bottleneck. However in a non-distributed implementation this approach could work, would it help to pick a sample from the input data, try to find a near-optimal lambda on that in a non-distributed way and use that for the distributed computation too? Not sure on this. Another issue I'm currently stuck on is how to compute the recommendations after the factorization. The rating matrix is factorized in the feature matrices U and M for users and items and a single rating can easily predicted by multiplying the feature vectors of the user and the item. If we wanna compute batch recommendations for all users we need to find an intelligent way to select "candidate items" to recommend for each users as we can't simply compute t(U) * M because those are dense matrices. Our non-distributed SVDRecommender uses simple cooccurrence to identify those candidate items, so one way I could think of would be to use RowSimilarityJob to find cooccurring items and foreach user to only compute his ratings for items that cooccur with preferred ones. Not sure either if this is the best way to do this.
        Hide
        srowen Sean Owen added a comment -

        Sebastian are you done with this? I know you added ALSWRFactorizer for example. Is that what this issue covered?

        Show
        srowen Sean Owen added a comment - Sebastian are you done with this? I know you added ALSWRFactorizer for example. Is that what this issue covered?
        Hide
        ssc Sebastian Schelter added a comment -

        No, unfortunately there's still a lot of open work here. ALSWRFactorizer is just the non-distributed implementation of this algorithm.

        In this issue I'm reaching for a distributed implementation. The actual matrix factorization part is working, but there are some open problems:

        • the factorization needs a regularization parameter called lambda, which heavily influences the quality of the result. I don't see how to automatically find a near optimal lambda (which would be key requirement for providing a certain ease of use of this algorithm). I have some code in the works that can find a near optimal lambda in a non-distributed way, but I'm not sure whether my approach is mathematically correct, I will put it up for review when I'm done.
        • When we have the factorization we can easily estimate single preferences by computing the dot product of the user and item vectors from the factorization. However if this job here should produce recommendations for all users, we cannot naively multiply the transpose of the user features matrix with the item features matrix to estimate all possible preferences as these are dense matrices. We need to find a way to isolate a few candidate items per user, maybe by utilizing item cooccurrence. I'm not sure what's the best approach here either as this problem is not covered in the paper.
        Show
        ssc Sebastian Schelter added a comment - No, unfortunately there's still a lot of open work here. ALSWRFactorizer is just the non-distributed implementation of this algorithm. In this issue I'm reaching for a distributed implementation. The actual matrix factorization part is working, but there are some open problems: the factorization needs a regularization parameter called lambda, which heavily influences the quality of the result. I don't see how to automatically find a near optimal lambda (which would be key requirement for providing a certain ease of use of this algorithm). I have some code in the works that can find a near optimal lambda in a non-distributed way, but I'm not sure whether my approach is mathematically correct, I will put it up for review when I'm done. When we have the factorization we can easily estimate single preferences by computing the dot product of the user and item vectors from the factorization. However if this job here should produce recommendations for all users, we cannot naively multiply the transpose of the user features matrix with the item features matrix to estimate all possible preferences as these are dense matrices. We need to find a way to isolate a few candidate items per user, maybe by utilizing item cooccurrence. I'm not sure what's the best approach here either as this problem is not covered in the paper.
        Hide
        lancenorskog Lance Norskog added a comment -

        the factorization needs a regularization parameter called lambda, which heavily influences the quality of the result.

        If you distribute the matrix portions, and run the lambda calculator in each job separately, how closely do they match? If they correlate well, would it work to have a bunch of similar lambdas?

        Show
        lancenorskog Lance Norskog added a comment - the factorization needs a regularization parameter called lambda, which heavily influences the quality of the result. If you distribute the matrix portions, and run the lambda calculator in each job separately, how closely do they match? If they correlate well, would it work to have a bunch of similar lambdas?
        Hide
        danny.bickson Danny Bickson added a comment - - edited

        Hi!
        I tried to install patch MAHOUT-542-3 against a clean svn checkout from Mahout trunk.
        it seems that the file AlternateLeastSquaresSolver.java was already inserted into svn, so the code got doubled.
        I tried to edit manually changes. I got several compilation errors. I attach a patch I created (named unifiedpatch)
        after fixing the compilation errors.

        The code compiles, but fails some tests.

        -------------------------------------------------------
        T E S T S
        -------------------------------------------------------
        Running org.apache.mahout.math.jet.random.NormalTest
        Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.437 sec
        Running org.apache.mahout.math.TestMatrixView
        Tests run: 49, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.283 sec
        Running org.apache.mahout.math.TestDenseVector
        Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.134 sec
        Running org.apache.mahout.math.TestSparseMatrix
        Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.315 sec
        Running org.apache.mahout.math.jet.stat.ProbabilityTest
        Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.162 sec
        Running org.apache.mahout.math.VectorListTest
        Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.291 sec
        Running org.apache.mahout.math.stats.LogLikelihoodTest
        Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.11 sec
        Running org.apache.mahout.math.jet.random.engine.MersenneTwisterTest
        Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.049 sec
        Running org.apache.mahout.math.decomposer.hebbian.TestHebbianSolver
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.071 sec
        Running org.apache.mahout.math.TestDenseMatrix
        Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
        Running org.apache.mahout.math.jet.random.ExponentialTest
        Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.707 sec
        Running org.apache.mahout.math.jet.stat.GammaTest
        Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.193 sec
        Running org.apache.mahout.math.stats.OnlineSummarizerTest
        Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec
        Running org.apache.mahout.math.VectorTest
        Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.048 sec
        Running org.apache.mahout.common.RandomUtilsTest
        Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.022 sec
        Running org.apache.mahout.math.TestSparseColumnMatrix
        Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.104 sec
        Running org.apache.mahout.math.TestSequentialAccessSparseVector
        Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.05 sec
        Running org.apache.mahout.math.QRDecompositionTest
        Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.045 sec
        Running org.apache.mahout.math.TestSparseRowMatrix
        Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.09 sec
        Running org.apache.mahout.math.TestRandomAccessSparseVector
        Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.036 sec
        Running org.apache.mahout.math.TestOrderedIntDoubleMapping
        Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
        Running org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
        Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
        Running org.apache.mahout.math.jet.random.NegativeBinomialTest
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
        Running org.apache.mahout.math.TestVectorView
        Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec
        Running org.apache.mahout.math.jet.random.GammaTest
        Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.72 sec
        Running org.apache.mahout.math.TestSingularValueDecomposition
        Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
        Running org.apache.mahout.math.decomposer.lanczos.TestLanczosSolver
        Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.875 sec

        Results :

        Failed tests:

        createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest)

        ubuntu@ip-10-195-226-63:/usr/local/mahout-0.4$ cat math/target/surefire-reports/org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.txt
        -------------------------------------------------------------------------------
        Test set: org.apache.mahout.math.als.AlternateLeastSquaresSolverTest
        -------------------------------------------------------------------------------
        Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
        createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest) Time elapsed: 0.002 sec <<< FAILURE!
        java.lang.AssertionError:
        at org.junit.Assert.fail(Assert.java:91)
        at org.junit.Assert.fail(Assert.java:98)
        at org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.createRiIiMaybeTransposedExceptionOnNonSequentialVector(AlternateLeastSquaresSolverTest.java:94)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
        at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
        at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
        at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
        at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
        at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)

        Can you please take a look and instruct me what to do?

        Thanks,

        Danny Bickson

        Show
        danny.bickson Danny Bickson added a comment - - edited Hi! I tried to install patch MAHOUT-542 -3 against a clean svn checkout from Mahout trunk. it seems that the file AlternateLeastSquaresSolver.java was already inserted into svn, so the code got doubled. I tried to edit manually changes. I got several compilation errors. I attach a patch I created (named unifiedpatch) after fixing the compilation errors. The code compiles, but fails some tests. ------------------------------------------------------- T E S T S ------------------------------------------------------- Running org.apache.mahout.math.jet.random.NormalTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.437 sec Running org.apache.mahout.math.TestMatrixView Tests run: 49, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.283 sec Running org.apache.mahout.math.TestDenseVector Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.134 sec Running org.apache.mahout.math.TestSparseMatrix Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.315 sec Running org.apache.mahout.math.jet.stat.ProbabilityTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.162 sec Running org.apache.mahout.math.VectorListTest Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.291 sec Running org.apache.mahout.math.stats.LogLikelihoodTest Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.11 sec Running org.apache.mahout.math.jet.random.engine.MersenneTwisterTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.049 sec Running org.apache.mahout.math.decomposer.hebbian.TestHebbianSolver Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.071 sec Running org.apache.mahout.math.TestDenseMatrix Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec Running org.apache.mahout.math.jet.random.ExponentialTest Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.707 sec Running org.apache.mahout.math.jet.stat.GammaTest Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.193 sec Running org.apache.mahout.math.stats.OnlineSummarizerTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec Running org.apache.mahout.math.VectorTest Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.048 sec Running org.apache.mahout.common.RandomUtilsTest Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.022 sec Running org.apache.mahout.math.TestSparseColumnMatrix Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.104 sec Running org.apache.mahout.math.TestSequentialAccessSparseVector Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.05 sec Running org.apache.mahout.math.QRDecompositionTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.045 sec Running org.apache.mahout.math.TestSparseRowMatrix Tests run: 58, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.09 sec Running org.apache.mahout.math.TestRandomAccessSparseVector Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.036 sec Running org.apache.mahout.math.TestOrderedIntDoubleMapping Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec Running org.apache.mahout.math.als.AlternateLeastSquaresSolverTest Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE! Running org.apache.mahout.math.jet.random.NegativeBinomialTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec Running org.apache.mahout.math.TestVectorView Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec Running org.apache.mahout.math.jet.random.GammaTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.72 sec Running org.apache.mahout.math.TestSingularValueDecomposition Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec Running org.apache.mahout.math.decomposer.lanczos.TestLanczosSolver Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.875 sec Results : Failed tests: createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest) ubuntu@ip-10-195-226-63:/usr/local/mahout-0.4$ cat math/target/surefire-reports/org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.txt ------------------------------------------------------------------------------- Test set: org.apache.mahout.math.als.AlternateLeastSquaresSolverTest ------------------------------------------------------------------------------- Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE! createRiIiMaybeTransposedExceptionOnNonSequentialVector(org.apache.mahout.math.als.AlternateLeastSquaresSolverTest) Time elapsed: 0.002 sec <<< FAILURE! java.lang.AssertionError: at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.fail(Assert.java:98) at org.apache.mahout.math.als.AlternateLeastSquaresSolverTest.createRiIiMaybeTransposedExceptionOnNonSequentialVector(AlternateLeastSquaresSolverTest.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102) at org.apache.maven.surefire.Surefire.run(Surefire.java:180) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350) at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021) Can you please take a look and instruct me what to do? Thanks, Danny Bickson
        Hide
        ssc Sebastian Schelter added a comment -

        Hi Danny,

        you're right the latest patch is out of sync with the current trunk as AlternateLeastSquaresSolver was already committed for MAHOUT-572. I'm a little busy currently but I'll try to look at your patch this weekend, hope that's ok for you.

        --sebastian

        Show
        ssc Sebastian Schelter added a comment - Hi Danny, you're right the latest patch is out of sync with the current trunk as AlternateLeastSquaresSolver was already committed for MAHOUT-572 . I'm a little busy currently but I'll try to look at your patch this weekend, hope that's ok for you. --sebastian
        Hide
        danny.bickson Danny Bickson added a comment -

        Thanks Sebastian!

        I have access to some large clusters, once you are done I will be happily help you test the current patch on Netflix or other larger datasets.

        Best,

        • Danny
        Show
        danny.bickson Danny Bickson added a comment - Thanks Sebastian! I have access to some large clusters, once you are done I will be happily help you test the current patch on Netflix or other larger datasets. Best, Danny
        Hide
        danny.bickson Danny Bickson added a comment -

        Hi everyone,
        Issue solved. I've created a new patch and tested it to work on movielens 1M dataset. On amaon EC2 machine it takes 1998 seconds (large instance).
        With an accuracy of: RMSE: 0.8546120366924382, MAE: 0.6798083002225481

        It seems that 3 files where already included in the patch while being in SVN.
        Another issue was line 94 of math.als.AlternateLeastSquaresSolverTest.java
        which calls fail(). This killed the unit testing. Commenting it and running the rest of the command did not seem to affect the running results.

        Please take a look into this.

        Best,

        Danny Bickson

        Show
        danny.bickson Danny Bickson added a comment - Hi everyone, Issue solved. I've created a new patch and tested it to work on movielens 1M dataset. On amaon EC2 machine it takes 1998 seconds (large instance). With an accuracy of: RMSE: 0.8546120366924382, MAE: 0.6798083002225481 It seems that 3 files where already included in the patch while being in SVN. Another issue was line 94 of math.als.AlternateLeastSquaresSolverTest.java which calls fail(). This killed the unit testing. Commenting it and running the rest of the command did not seem to affect the running results. Please take a look into this. Best, Danny Bickson
        Hide
        ssc Sebastian Schelter added a comment -

        updated the patch to work with the current trunk. I only had to remove AlternateLeastSquaresSolver as this class was already committed.

        btw: the call to fail() in the unit test is valid, it's used to make a test fail when a exception is not thrown that you expect to be thrown

        It would be great if you use the current version of this patch to run the factorization on the netflix dataset with the parameters described in the paper. I'd love to see if we get roughly the same results.

        Show
        ssc Sebastian Schelter added a comment - updated the patch to work with the current trunk. I only had to remove AlternateLeastSquaresSolver as this class was already committed. btw: the call to fail() in the unit test is valid, it's used to make a test fail when a exception is not thrown that you expect to be thrown It would be great if you use the current version of this patch to run the factorization on the netflix dataset with the parameters described in the paper. I'd love to see if we get roughly the same results.
        Hide
        danny.bickson Danny Bickson added a comment -

        Hi again,
        When I run the testing previous time (after removing AlternateLeastSquareSolver) I got the test error as described in my previous post (line 94 of the test org.apache.mahout.math.als.AlternateLeastSquaresSolverTest), fail() was called since exception was not thrown.

        Did you make changes in the patch to address this problem?

        Thanks!

        DB

        Show
        danny.bickson Danny Bickson added a comment - Hi again, When I run the testing previous time (after removing AlternateLeastSquareSolver) I got the test error as described in my previous post (line 94 of the test org.apache.mahout.math.als.AlternateLeastSquaresSolverTest), fail() was called since exception was not thrown. Did you make changes in the patch to address this problem? Thanks! DB
        Hide
        ssc Sebastian Schelter added a comment -

        Did you apply the patch on the current trunk? Without any other of the patches attached here applied? I did this on my local machine, ran the tests and everything worked fine.

        Show
        ssc Sebastian Schelter added a comment - Did you apply the patch on the current trunk? Without any other of the patches attached here applied? I did this on my local machine, ran the tests and everything worked fine.
        Hide
        danny.bickson Danny Bickson added a comment -

        Hi,
        Everything works now with the new patch (542-5). With the MovieLens 1M data everything works fine, I have tested with one, two and four slaves.
        With Netflix data, I get the following exception:

        2011-02-04 19:42:45,613 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_0: Error: GC overhead limit exceeded
        2011-02-04 19:42:45,614 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_0' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

        2011-02-04 19:42:48,617 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_1' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

        2011-02-04 19:42:48,618 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_0' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

        2011-02-04 21:10:48,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_1: Error: GC overhead limit exceeded
        2011-02-04 21:10:48,030 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_1' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

        2011-02-04 21:10:54,036 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_2' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

        2011-02-04 21:10:54,036 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_1' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

        2011-02-04 22:36:46,339 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_2: Error: GC overhead limit exceeded
        2011-02-04 22:36:46,339 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_2' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

        2011-02-04 22:36:49,342 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_3' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

        2011-02-04 22:36:49,355 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_2' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339'

        Any ideas about how to fix this?

        Thanks!!

        Danny Bickson

        Show
        danny.bickson Danny Bickson added a comment - Hi, Everything works now with the new patch (542-5). With the MovieLens 1M data everything works fine, I have tested with one, two and four slaves. With Netflix data, I get the following exception: 2011-02-04 19:42:45,613 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_0: Error: GC overhead limit exceeded 2011-02-04 19:42:45,614 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_0' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339' 2011-02-04 19:42:48,617 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_1' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339' 2011-02-04 19:42:48,618 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_0' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339' 2011-02-04 21:10:48,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_1: Error: GC overhead limit exceeded 2011-02-04 21:10:48,030 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_1' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339' 2011-02-04 21:10:54,036 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_2' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339' 2011-02-04 21:10:54,036 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_1' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339' 2011-02-04 22:36:46,339 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201102041322_0007_r_000000_2: Error: GC overhead limit exceeded 2011-02-04 22:36:46,339 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201102041322_0007_r_000000_2' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339' 2011-02-04 22:36:49,342 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_201102041322_0007_r_000000_3' to tip task_201102041322_0007_r_000000, for tracker 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339' 2011-02-04 22:36:49,355 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201102041322_0007_r_000000_2' from 'tracker_ip-10-202-161-172.ec2.internal:localhost/127.0.0.1:49339' Any ideas about how to fix this? Thanks!! Danny Bickson
        Hide
        ssc Sebastian Schelter added a comment -

        Can you share some more details about the machines you ran this on?

        Show
        ssc Sebastian Schelter added a comment - Can you share some more details about the machines you ran this on?
        Hide
        lancenorskog Lance Norskog added a comment -

        GC overhead limit exceeded

        This means that the JVM spends 98% of its time doing garbage collection. It is usually a sign of "not quite enough memory". There are other JVM parameters around garbage collection that you can tune: change the algorithm, change specific parameters. 'jvisualvm' or 'visualgc' is a free program in Java 1.6. You can grab onto the running Map/Reduce app and watch the garbage collection and memory allocation levels.

        Show
        lancenorskog Lance Norskog added a comment - GC overhead limit exceeded This means that the JVM spends 98% of its time doing garbage collection. It is usually a sign of "not quite enough memory". There are other JVM parameters around garbage collection that you can tune: change the algorithm, change specific parameters. 'jvisualvm' or 'visualgc' is a free program in Java 1.6. You can grab onto the running Map/Reduce app and watch the garbage collection and memory allocation levels.
        Hide
        ssc Sebastian Schelter added a comment -

        I also think it's a memory problem. When recomputing user features the algorithm needs to look at the feature vectors of all movies the user has rated and when recomputing movie features the algorithm has to look at the feature vectors of all users which have rated this movie. I'm not very familiar with the netflix dataset but I think there might be some very popular movies that have been rated by lots of users and there might also be some "power" users that have rated lots of movies. So the memory consumption might get very high in some steps.

        What kind of ec2 instances did you run this on? Did you use small instances? From my experience those are not very helpful, maybe you could retry this with large or c1.medium instances.

        Show
        ssc Sebastian Schelter added a comment - I also think it's a memory problem. When recomputing user features the algorithm needs to look at the feature vectors of all movies the user has rated and when recomputing movie features the algorithm has to look at the feature vectors of all users which have rated this movie. I'm not very familiar with the netflix dataset but I think there might be some very popular movies that have been rated by lots of users and there might also be some "power" users that have rated lots of movies. So the memory consumption might get very high in some steps. What kind of ec2 instances did you run this on? Did you use small instances? From my experience those are not very helpful, maybe you could retry this with large or c1.medium instances.
        Hide
        danny.bickson Danny Bickson added a comment - - edited

        Hi!
        I used two m2.xlarge EC2 machines.
        I attach the program output (see attachment file logs.zip above), including all the Hadoop configuration files from the two machines and all the Hadoop logs. It seems that I got the GC exception again. Let me know if you identify a reason why I am getting out of memory.

        Thanks,

        Danny Bickson

        Show
        danny.bickson Danny Bickson added a comment - - edited Hi! I used two m2.xlarge EC2 machines. I attach the program output (see attachment file logs.zip above), including all the Hadoop configuration files from the two machines and all the Hadoop logs. It seems that I got the GC exception again. Let me know if you identify a reason why I am getting out of memory. Thanks, Danny Bickson
        Hide
        danny.bickson Danny Bickson added a comment -

        Problem solved -
        1)I have increased heap size to 4GB
        2)Moved to a larger instance: m2.2xlarge
        3)Increased children mappers memory to 2GB

        One or more of those changes fixed the memory error.

        One iteration using the full netflix data takes around 75 minutes.
        RMSE looks good: 1.04 after one iteration.

        • Danny
        Show
        danny.bickson Danny Bickson added a comment - Problem solved - 1)I have increased heap size to 4GB 2)Moved to a larger instance: m2.2xlarge 3)Increased children mappers memory to 2GB One or more of those changes fixed the memory error. One iteration using the full netflix data takes around 75 minutes. RMSE looks good: 1.04 after one iteration. Danny
        Hide
        ssc Sebastian Schelter added a comment -

        Wow, that's really awesome news. Do you plan on doing more experiments?

        Show
        ssc Sebastian Schelter added a comment - Wow, that's really awesome news. Do you plan on doing more experiments?
        Hide
        danny.bickson Danny Bickson added a comment - - edited

        Yes, I am doing some more experiments, because we are comparing our system, GraphLab
        http://www.graphlab.ml.cmu.edu/ to Mahout.

        I am not sure if this is the right forum to update about performance, but I will be glad to
        update anyone interested - email me...

        Show
        danny.bickson Danny Bickson added a comment - - edited Yes, I am doing some more experiments, because we are comparing our system, GraphLab http://www.graphlab.ml.cmu.edu/ to Mahout. I am not sure if this is the right forum to update about performance, but I will be glad to update anyone interested - email me...
        Hide
        ssc Sebastian Schelter added a comment -

        Attached a new version of the patch. I'd like to commit this one in the next days, if there are no objections (and no errors found). This patches removes some parts of the code that were highly memory intensive and hopefully enables tests with a higher number of features. It introduces a set of tools that might enable a first realworld usage of this algorithm:

        • DatasetSplitter: split a rating dataset into training and probe parts
        • ParallelALSFactorizationJob: parallel ALS-WR factorization of a rating matrix
        • PredictionJob: predict preferences using the factorization of a rating matrix
        • InMemoryFactorizationEvaluator: compute RMSE of a rating matrix factorization against probes in memory
        • ParallelFactorizationEvaluator: compute RMSE of a rating matrix factorization against probes

        There are still open points, in particular how to find a good regularization parameter automatically and efficiently and how to create an automated recommender pipeline similar to that of RecommenderJob using these tools. But I think these issues can be tackled in the future.

        Here's how to play with the code:

        # convert the movielens 1M dataset to mahout's common format for ratings
        cat /path/to/ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > /path/to/ratings.csv
        
        # create a 90% percent training set and a 10% probe set
        bin/mahout splitDataset --input /path/to/ratings.csv --output /tmp/dataset --trainingPercentage 0.9 --probePercentage 0.1
        
        # run distributed ALS-WR to factorize the rating matrix based on the training set
        bin/mahout parallelALS --input /tmp/dataset/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065
        
        # compute predictions against the probe set, measure the error
        bin/mahout evaluateFactorizationParallel --output /tmp/als/rmse --pairs /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/
        
        # print the error
        cat /tmp/als/rmse/rmse.txt 
        0.8531723318490103
        
        # alternatively you can use the factorization to predict unknown ratings
        bin/mahout predictFromFactorization --output /tmp/als/predict --pairs /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/ --tempDir /tmp/als/predictTmp
        
        # look at the predictions
        cat /tmp/als/predict/part-r-*
        1,150,4.0842405867880975
        1,1029,4.163510579205656
        1,745,3.7759166479388777
        1,2294,3.495085673991081
        1,938,3.6820865362790594
        2,2067,3.8303249557251644
        2,1090,3.954322089979675
        2,1196,3.912089186677311
        2,498,2.820740198815573
        2,593,4.090550572202017
        ...
        
        Show
        ssc Sebastian Schelter added a comment - Attached a new version of the patch. I'd like to commit this one in the next days, if there are no objections (and no errors found). This patches removes some parts of the code that were highly memory intensive and hopefully enables tests with a higher number of features. It introduces a set of tools that might enable a first realworld usage of this algorithm: DatasetSplitter: split a rating dataset into training and probe parts ParallelALSFactorizationJob: parallel ALS-WR factorization of a rating matrix PredictionJob: predict preferences using the factorization of a rating matrix InMemoryFactorizationEvaluator: compute RMSE of a rating matrix factorization against probes in memory ParallelFactorizationEvaluator: compute RMSE of a rating matrix factorization against probes There are still open points, in particular how to find a good regularization parameter automatically and efficiently and how to create an automated recommender pipeline similar to that of RecommenderJob using these tools. But I think these issues can be tackled in the future. Here's how to play with the code: # convert the movielens 1M dataset to mahout's common format for ratings cat /path/to/ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > /path/to/ratings.csv # create a 90% percent training set and a 10% probe set bin/mahout splitDataset --input /path/to/ratings.csv --output /tmp/dataset --trainingPercentage 0.9 --probePercentage 0.1 # run distributed ALS-WR to factorize the rating matrix based on the training set bin/mahout parallelALS --input /tmp/dataset/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 # compute predictions against the probe set, measure the error bin/mahout evaluateFactorizationParallel --output /tmp/als/rmse --pairs /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/ # print the error cat /tmp/als/rmse/rmse.txt 0.8531723318490103 # alternatively you can use the factorization to predict unknown ratings bin/mahout predictFromFactorization --output /tmp/als/predict --pairs /tmp/dataset/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/ --tempDir /tmp/als/predictTmp # look at the predictions cat /tmp/als/predict/part-r-* 1,150,4.0842405867880975 1,1029,4.163510579205656 1,745,3.7759166479388777 1,2294,3.495085673991081 1,938,3.6820865362790594 2,2067,3.8303249557251644 2,1090,3.954322089979675 2,1196,3.912089186677311 2,498,2.820740198815573 2,593,4.090550572202017 ...
        Hide
        tdunning Ted Dunning added a comment -

        Sebastian,

        I don't want to derail your commit, but your question about regularization suggested a thought to me.

        One of the great advantages of the random projection methods over power law methods is due to the fact that iteration is so evil in Hadoop-base map-reduce, especially when you are simply reading the same input over and over.

        With ALS-WR, you can run the program again for each value of regularization parameter, but there is really nothing except possibly memory size from running all of these optimizations at the same time.

        How hard would that be, do you think, to interleave the computations for multiple values of regularization parameter into a single run of ALS-WR?

        Show
        tdunning Ted Dunning added a comment - Sebastian, I don't want to derail your commit, but your question about regularization suggested a thought to me. One of the great advantages of the random projection methods over power law methods is due to the fact that iteration is so evil in Hadoop-base map-reduce, especially when you are simply reading the same input over and over. With ALS-WR, you can run the program again for each value of regularization parameter, but there is really nothing except possibly memory size from running all of these optimizations at the same time. How hard would that be, do you think, to interleave the computations for multiple values of regularization parameter into a single run of ALS-WR?
        Hide
        ssc Sebastian Schelter added a comment -

        Hi Ted,

        I already spent some thinking on that, unfortunately no easy way to do that came on my mind. I'll share my line of thoughts, maybe there's some error in how I understand the algorithm:

        The goal of ALS-WR is to factorize the rating matrix R into the user feature matrix U and the item feature matrix M. I tries to minimize the regularized squared error between R und U(T)M via an iterative algorithm. We start with a randomly initialized M, use this to compute an optimized U, fix U after that to compute an optimized M. We rotate between a fixed U and a fixed M until the error converges (or a maximum number of iterations is reached).

        The question is now whether we can modify this process to use multiple lambda values (the regularization parameter used for the internal computations). You stated that we are "reading the same input over and over", which I don't see as all versions of U and M (with the exception of the first randomly initialized version of M) are dependent on the lambda that was used to produce them. So as I understand the algorithm, n values for lambda would not only mean n times the memory for computations is needed but also n versions of each U and M would need to be moved around. I think I could find a way to make the code do that (by maybe using something like a multi-version-vector that carries different results for different lambdas), but I'm not sure whether that's what you had in mind.

        Show
        ssc Sebastian Schelter added a comment - Hi Ted, I already spent some thinking on that, unfortunately no easy way to do that came on my mind. I'll share my line of thoughts, maybe there's some error in how I understand the algorithm: The goal of ALS-WR is to factorize the rating matrix R into the user feature matrix U and the item feature matrix M. I tries to minimize the regularized squared error between R und U(T)M via an iterative algorithm. We start with a randomly initialized M, use this to compute an optimized U, fix U after that to compute an optimized M. We rotate between a fixed U and a fixed M until the error converges (or a maximum number of iterations is reached). The question is now whether we can modify this process to use multiple lambda values (the regularization parameter used for the internal computations). You stated that we are "reading the same input over and over", which I don't see as all versions of U and M (with the exception of the first randomly initialized version of M) are dependent on the lambda that was used to produce them. So as I understand the algorithm, n values for lambda would not only mean n times the memory for computations is needed but also n versions of each U and M would need to be moved around. I think I could find a way to make the code do that (by maybe using something like a multi-version-vector that carries different results for different lambdas), but I'm not sure whether that's what you had in mind.
        Hide
        tdunning Ted Dunning added a comment -

        Yes.

        Keeping multiple versions in the computation is exactly what I had in mind. Whether this is useful or not depends on whether each version of U and M are smaller than the original. They certainly will be more efficient to read. As such, reading them multiple times might be a win.

        Or not. I haven't worked out the details and that is what really, really matters here.

        Show
        tdunning Ted Dunning added a comment - Yes. Keeping multiple versions in the computation is exactly what I had in mind. Whether this is useful or not depends on whether each version of U and M are smaller than the original. They certainly will be more efficient to read. As such, reading them multiple times might be a win. Or not. I haven't worked out the details and that is what really, really matters here.
        Hide
        ssc Sebastian Schelter added a comment -

        I have found a way to make the code do the factorization for multiple lambdas at once by making the intermediate writables carry an extra feature vector for each extra lambda. How do we proceed now, do we try to automatically find a near-optimal lambda or do we leave it up to the user to experiment a little?

        Show
        ssc Sebastian Schelter added a comment - I have found a way to make the code do the factorization for multiple lambdas at once by making the intermediate writables carry an extra feature vector for each extra lambda. How do we proceed now, do we try to automatically find a near-optimal lambda or do we leave it up to the user to experiment a little?
        Hide
        srowen Sean Owen added a comment -

        In terms of this issue, it sounds like you've made enough progress that it's worth committing so people can start to use it. Sounds like it works and you're just thinking about future optimizations.

        Show
        srowen Sean Owen added a comment - In terms of this issue, it sounds like you've made enough progress that it's worth committing so people can start to use it. Sounds like it works and you're just thinking about future optimizations.
        Hide
        ssc Sebastian Schelter added a comment -

        I think so too, we should commit this now and reiterate for automatic lambda detection and optimization, are you ok with this Ted?

        Show
        ssc Sebastian Schelter added a comment - I think so too, we should commit this now and reiterate for automatic lambda detection and optimization, are you ok with this Ted?
        Hide
        tdunning Ted Dunning added a comment -

        Progress always sounds good to me. It is even better when there is a plan for further progress.

        I say ship it!

        Show
        tdunning Ted Dunning added a comment - Progress always sounds good to me. It is even better when there is a plan for further progress. I say ship it!
        Hide
        hudson Hudson added a comment -

        Integrated in Mahout-Quality #690 (See https://hudson.apache.org/hudson/job/Mahout-Quality/690/)
        MAHOUT-542 MapReduce implementation of ALS-WR

        Show
        hudson Hudson added a comment - Integrated in Mahout-Quality #690 (See https://hudson.apache.org/hudson/job/Mahout-Quality/690/ ) MAHOUT-542 MapReduce implementation of ALS-WR
        Hide
        hudson Hudson added a comment -

        Integrated in Mahout-Quality #742 (See https://hudson.apache.org/hudson/job/Mahout-Quality/742/)
        MAHOUT-542 missing evaluation classes

        Show
        hudson Hudson added a comment - Integrated in Mahout-Quality #742 (See https://hudson.apache.org/hudson/job/Mahout-Quality/742/ ) MAHOUT-542 missing evaluation classes
        Hide
        camclive Clive Cox added a comment -

        An earlier comment said:
        "...if this job here should produce recommendations for all users, we cannot naively multiply the transpose of the user features matrix with the item features matrix to estimate all possible preferences as these are dense matrices."

        Can people explain further the problems with the naive approach?

        How are people generally deriving recommendations from Matrix Factorization techniques? Get a neighnourhood from some other CF algorithm and then score only that neighnbourhood using the derived matrices?

        Show
        camclive Clive Cox added a comment - An earlier comment said: "...if this job here should produce recommendations for all users, we cannot naively multiply the transpose of the user features matrix with the item features matrix to estimate all possible preferences as these are dense matrices." Can people explain further the problems with the naive approach? How are people generally deriving recommendations from Matrix Factorization techniques? Get a neighnourhood from some other CF algorithm and then score only that neighnbourhood using the derived matrices?
        Hide
        ssc Sebastian Schelter added a comment -

        The problem with this naive approach is that the resulting matrix is going to be huge (millions of users times hundred thousands of items) and dense, which makes it uncomputable.

        I'm not aware of a general approach of computing recommendations from matrix decompositions, in the scientific literature these are only used for measuring the prediction error on held out data (as far as I know)

        Show
        ssc Sebastian Schelter added a comment - The problem with this naive approach is that the resulting matrix is going to be huge (millions of users times hundred thousands of items) and dense, which makes it uncomputable. I'm not aware of a general approach of computing recommendations from matrix decompositions, in the scientific literature these are only used for measuring the prediction error on held out data (as far as I know)
        Hide
        pragatimeena pragati meena added a comment -

        hi sebastian,

        i am trying to run the example in windows using hadoop on cygwin , but i keep getting the following error ,even though history file exists at
        same directory location

        Exception in thread "main" java.lang.IllegalStateException: java.io.FileNotFoundException: File does not exist: /user/hadoop/temp/errors/_logs
        at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:73)
        at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:67)
        at com.google.common.collect.Iterators$8.next(Iterators.java:730)
        at com.google.common.collect.Iterators$5.hasNext(Iterators.java:508)
        at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:40)
        at org.apache.mahout.utils.eval.ParallelFactorizationEvaluator.computeRmse(ParallelFactorizationEvaluator.java:111)

        Any ideas on how to fix this

        regards

        Pragati Meena

        Show
        pragatimeena pragati meena added a comment - hi sebastian, i am trying to run the example in windows using hadoop on cygwin , but i keep getting the following error ,even though history file exists at same directory location Exception in thread "main" java.lang.IllegalStateException: java.io.FileNotFoundException: File does not exist: /user/hadoop/temp/errors/_logs at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:73) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:67) at com.google.common.collect.Iterators$8.next(Iterators.java:730) at com.google.common.collect.Iterators$5.hasNext(Iterators.java:508) at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:40) at org.apache.mahout.utils.eval.ParallelFactorizationEvaluator.computeRmse(ParallelFactorizationEvaluator.java:111) Any ideas on how to fix this regards Pragati Meena
        Hide
        faal Fabian Alenius added a comment -

        Hi, I was thinking of rewriting the itemRatings and userRatings job into one job using MultipleOutputs. Based on my understanding release 0.20.2* supports MultipleOutputs, although using deprecated APIS. Would such a patch be accepted or are there issues prohibiting such a change?

        What is the current target version of Hadoop?

        Show
        faal Fabian Alenius added a comment - Hi, I was thinking of rewriting the itemRatings and userRatings job into one job using MultipleOutputs. Based on my understanding release 0.20.2* supports MultipleOutputs, although using deprecated APIS. Would such a patch be accepted or are there issues prohibiting such a change? What is the current target version of Hadoop?
        Hide
        ssc Sebastian Schelter added a comment -

        Current hadoop version is 0.20.204.0

        I don't think that the ALS-WR is really used at the moment and I don't think it fits the MapReduce paradigm very well, so you'd have my go in playing with the code. We still use the deprecated APIs for a lot of our distributed linear algebra stuff, so I'd be okay with it. But I'd like to hear other opinions first.

        Show
        ssc Sebastian Schelter added a comment - Current hadoop version is 0.20.204.0 I don't think that the ALS-WR is really used at the moment and I don't think it fits the MapReduce paradigm very well, so you'd have my go in playing with the code. We still use the deprecated APIs for a lot of our distributed linear algebra stuff, so I'd be okay with it. But I'd like to hear other opinions first.
        Hide
        faal Fabian Alenius added a comment -

        Okay. I'll wait a bit to see if anyone objects.

        I'm going to implement Collaborative Filtering for Implicit Feedback Datasets and my plan was to try and reuse as much of the ParallelAlsFactorization code as possible. Do you see any issues with this?

        Show
        faal Fabian Alenius added a comment - Okay. I'll wait a bit to see if anyone objects. I'm going to implement Collaborative Filtering for Implicit Feedback Datasets and my plan was to try and reuse as much of the ParallelAlsFactorization code as possible. Do you see any issues with this?
        Hide
        tdunning Ted Dunning added a comment -

        Go for it. The Hadoop API is very confused at the moment in any case.

        Use whatever you like from 0.20.204

        Show
        tdunning Ted Dunning added a comment - Go for it. The Hadoop API is very confused at the moment in any case. Use whatever you like from 0.20.204
        Hide
        alvina Alvin AuYoung added a comment -

        Hi Sebastian,

        First of all, many thanks for contributing this ALS implementation. It's very useful. Like others on this list, I'm trying to run some experiments on it using the Netflix data, but I'm seeing an error I am having trouble diagnosing. After completing the first 4 jobs, reduce copiers are failing for the 5th job (Mapper-SolvingReducer). I'm running on hadoop-0.20.2 and checked out mahout from the trunk, so I believe any patches you've mentioned should be incorporated.

        Here is a description of the job I'm running: MAHOUT-JOB: /home/auyoung/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
        11/09/30 01:20:56 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=training_all_triplets_norm, --lambda=0.065, --numFeatures=25, --numIterations=5, --output=als.out, --startPhase=0, --tempDir=temp}

        Do you have any ideas what might be wrong? I'm running it on a physical cluster of 20 slaves, each with 2 mappers and reducers, and there is > 8 GB memory (per jvm), > 2 GB HADOOP_HEAPSIZE, and the maximum allowable io.sort.mb of 2047. Also, there is plenty of disk space remaining. Here is a transcript of one of the several failures on the ParallelALSFactorizationJob-Mapper-SolvingReducer:

        2011-09-30 02:05:37,115 INFO org.apache.hadoop.mapred.Merger: Merging 16 sorted segments
        2011-09-30 02:05:37,115 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 16 segments left of total size: 1039493457 bytes
        2011-09-30 02:05:37,116 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201109300120_0005_r_000000_0 Merge of the inmemory files threw an exception: java.io.IOException: Intermediate merge failed
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2576)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
        Caused by: java.lang.RuntimeException: java.io.EOFException
        at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
        at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:373)
        at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:136)
        at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103)
        at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:335)
        at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
        at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2560)
        ... 1 more
        Caused by: java.io.EOFException
        at java.io.DataInputStream.readByte(DataInputStream.java:250)
        at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
        at org.apache.mahout.math.Varint.readSignedVarInt(Varint.java:140)
        at org.apache.mahout.cf.taste.hadoop.als.IndexedVarIntWritable.readFields(IndexedVarIntWritable.java:64)
        at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:97)
        ... 8 more

        2011-09-30 02:05:37,116 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201109300120_0005_r_000000_0 Merging of the local FS files threw an exception: java.io.IOException: java.lang.RuntimeException: java.io.EOFException
        at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
        at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:373)
        at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:139)
        at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103)
        at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:335)
        at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
        at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2454)
        Caused by: java.io.EOFException
        at java.io.DataInputStream.readByte(DataInputStream.java:250)
        at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
        at org.apache.mahout.math.Varint.readSignedVarInt(Varint.java:140)
        at org.apache.mahout.cf.taste.hadoop.als.IndexedVarIntWritable.readFields(IndexedVarIntWritable.java:64)
        at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:100)
        ... 7 more

        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2458)

        Thanks,

        Alvin

        Show
        alvina Alvin AuYoung added a comment - Hi Sebastian, First of all, many thanks for contributing this ALS implementation. It's very useful. Like others on this list, I'm trying to run some experiments on it using the Netflix data, but I'm seeing an error I am having trouble diagnosing. After completing the first 4 jobs, reduce copiers are failing for the 5th job (Mapper-SolvingReducer). I'm running on hadoop-0.20.2 and checked out mahout from the trunk, so I believe any patches you've mentioned should be incorporated. Here is a description of the job I'm running: MAHOUT-JOB: /home/auyoung/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar 11/09/30 01:20:56 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=training_all_triplets_norm, --lambda=0.065, --numFeatures=25, --numIterations=5, --output=als.out, --startPhase=0, --tempDir=temp} Do you have any ideas what might be wrong? I'm running it on a physical cluster of 20 slaves, each with 2 mappers and reducers, and there is > 8 GB memory (per jvm), > 2 GB HADOOP_HEAPSIZE, and the maximum allowable io.sort.mb of 2047. Also, there is plenty of disk space remaining. Here is a transcript of one of the several failures on the ParallelALSFactorizationJob-Mapper-SolvingReducer: 2011-09-30 02:05:37,115 INFO org.apache.hadoop.mapred.Merger: Merging 16 sorted segments 2011-09-30 02:05:37,115 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 16 segments left of total size: 1039493457 bytes 2011-09-30 02:05:37,116 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201109300120_0005_r_000000_0 Merge of the inmemory files threw an exception: java.io.IOException: Intermediate merge failed at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2576) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501) Caused by: java.lang.RuntimeException: java.io.EOFException at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103) at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:373) at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:136) at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:335) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350) at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2560) ... 1 more Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159) at org.apache.mahout.math.Varint.readSignedVarInt(Varint.java:140) at org.apache.mahout.cf.taste.hadoop.als.IndexedVarIntWritable.readFields(IndexedVarIntWritable.java:64) at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:97) ... 8 more 2011-09-30 02:05:37,116 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201109300120_0005_r_000000_0 Merging of the local FS files threw an exception: java.io.IOException: java.lang.RuntimeException: java.io.EOFException at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103) at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:373) at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:139) at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:335) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350) at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2454) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159) at org.apache.mahout.math.Varint.readSignedVarInt(Varint.java:140) at org.apache.mahout.cf.taste.hadoop.als.IndexedVarIntWritable.readFields(IndexedVarIntWritable.java:64) at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:100) ... 7 more at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2458) Thanks, Alvin
        Hide
        cendrillon Raphael Cendrillon added a comment -

        I'd like to get more involved in contributing to Mahout. In particular if there's any area you need support regarding ALS-WR or other topics as well I'd be very happy to lend a hand.

        In particular I was quite interested in your comments on automatically finding a good setting for lambda. I'm wondering whether something more sophisticated could be done than exhaustive search, for example if the loss function evaluated on the hold-out dataset is a convex function of lambda then gradient descent (or quasi-Newton methods) could be used.

        Show
        cendrillon Raphael Cendrillon added a comment - I'd like to get more involved in contributing to Mahout. In particular if there's any area you need support regarding ALS-WR or other topics as well I'd be very happy to lend a hand. In particular I was quite interested in your comments on automatically finding a good setting for lambda. I'm wondering whether something more sophisticated could be done than exhaustive search, for example if the loss function evaluated on the hold-out dataset is a convex function of lambda then gradient descent (or quasi-Newton methods) could be used.
        Hide
        ssc Sebastian Schelter added a comment -

        Its an interesting idea. Please open a new jira issue for it, as this is already closed and was thought to only represent the initial implementation.

        Show
        ssc Sebastian Schelter added a comment - Its an interesting idea. Please open a new jira issue for it, as this is already closed and was thought to only represent the initial implementation.

          People

          • Assignee:
            ssc Sebastian Schelter
            Reporter:
            ssc Sebastian Schelter
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development