Mahout
  1. Mahout
  2. MAHOUT-1272

Parallel SGD matrix factorizer for SVDrecommender

    Details

      Description

      a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor.

      existing code is single-thread and perhaps may still be outperformed by the default ALS-WR.

      In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one.

      The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).

      Related discussion has been carried on for a while but remain inconclusive:
      http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

      1. GroupLensSVDRecomenderEvaluatorRunner.java
        4 kB
        Peng Cheng
      2. libimsetiSVDRecomenderEvaluatorRunner.java
        5 kB
        Peng Cheng
      3. mahout.patch
        23 kB
        Peng Cheng
      4. NetflixRecomenderEvaluatorRunner.java
        5 kB
        Peng Cheng
      5. ParallelSGDFactorizer.java
        14 kB
        Peng Cheng
      6. ParallelSGDFactorizer.java
        12 kB
        Peng Cheng
      7. ParallelSGDFactorizerTest.java
        11 kB
        Peng Cheng
      8. ParallelSGDFactorizerTest.java
        10 kB
        Peng Cheng

        Activity

        Hide
        Sebastian Schelter added a comment -

        Are refering to a single-machine multi-core implementation or a MapReduce implementation?

        Show
        Sebastian Schelter added a comment - Are refering to a single-machine multi-core implementation or a MapReduce implementation?
        Hide
        Peng Cheng added a comment -

        I presume it be be a single-machine multi-core? many people on the dicussion has voted against iterative MR. Not sure though...

        Show
        Peng Cheng added a comment - I presume it be be a single-machine multi-core? many people on the dicussion has voted against iterative MR. Not sure though...
        Hide
        Peng Cheng added a comment -

        I'm reading the source code of ALS-WR, apparently it uses an ExecutorService to distribute ALS to each core.
        There is no MR here. I just started using it for a few days. Plz correct me if I'm wrong.

        Show
        Peng Cheng added a comment - I'm reading the source code of ALS-WR, apparently it uses an ExecutorService to distribute ALS to each core. There is no MR here. I just started using it for a few days. Plz correct me if I'm wrong.
        Hide
        Sebastian Schelter added a comment -

        There is also a MR version of ALS. But I agree that it would be better to start with a single machine implementation of DSGD or Hogwild. If its faster than ALS-WR, it would be a good replacement for RatingSGDFactorizer and ALSWRFactorizer. What do you think?

        Show
        Sebastian Schelter added a comment - There is also a MR version of ALS. But I agree that it would be better to start with a single machine implementation of DSGD or Hogwild. If its faster than ALS-WR, it would be a good replacement for RatingSGDFactorizer and ALSWRFactorizer. What do you think?
        Hide
        Peng Cheng added a comment -

        Thank a lot for the hint! Is it in org.apache.mahout.math.als? I can't find any other implementation in core-0.7
        Yeah, I think this should be a good practice to start with, regardless of whether it has any performance edge.
        I'll try to do something this weekend.

        Show
        Peng Cheng added a comment - Thank a lot for the hint! Is it in org.apache.mahout.math.als? I can't find any other implementation in core-0.7 Yeah, I think this should be a good practice to start with, regardless of whether it has any performance edge. I'll try to do something this weekend.
        Hide
        Sebastian Schelter added a comment -

        You should use the trunk, lots of things have been improved. Take your time to work on the code, no need for a hurry

        Show
        Sebastian Schelter added a comment - You should use the trunk, lots of things have been improved. Take your time to work on the code, no need for a hurry
        Hide
        Peng Cheng added a comment -

        learning rate/step size are set to be identical to package ~.classifier.sgd, the old learning rate is exponential with a constant decaying factor, this setting seems to be only working for smooth functions (proved by Nesterov?), I'm not sure if it is true in CF. Otherwise, either use 1/sqrt for convex f or 1/n for strongly convex f.

        Show
        Peng Cheng added a comment - learning rate/step size are set to be identical to package ~.classifier.sgd, the old learning rate is exponential with a constant decaying factor, this setting seems to be only working for smooth functions (proved by Nesterov?), I'm not sure if it is true in CF. Otherwise, either use 1/sqrt for convex f or 1/n for strongly convex f.
        Hide
        Peng Cheng added a comment -

        Looks like the 1/n learning rate doesn't work at all on SGD factorizer, maybe the convergence of stochastic optimization can't be applied on the non-convex MF problem. Can someone show me a paper discussing convergence bound of such problem? Much appreciated.

        Show
        Peng Cheng added a comment - Looks like the 1/n learning rate doesn't work at all on SGD factorizer, maybe the convergence of stochastic optimization can't be applied on the non-convex MF problem. Can someone show me a paper discussing convergence bound of such problem? Much appreciated.
        Hide
        Peng Cheng added a comment - - edited

        Hey I have finished the class and test for parallel sgd factorizer for matrix-completion based recommender (not mapreduced, just single machine multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only tested on toy and synthetic data (2000users * 1000 items) but it is pretty fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, apparently the executor induces high overhead allocation cost) And definitely faster than single machine ALSWR.

        I'm submitting my java files and patch for review.

        Show
        Peng Cheng added a comment - - edited Hey I have finished the class and test for parallel sgd factorizer for matrix-completion based recommender (not mapreduced, just single machine multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only tested on toy and synthetic data (2000users * 1000 items) but it is pretty fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, apparently the executor induces high overhead allocation cost) And definitely faster than single machine ALSWR. I'm submitting my java files and patch for review.
        Hide
        Peng Cheng added a comment -

        java file

        Show
        Peng Cheng added a comment - java file
        Hide
        Peng Cheng added a comment -

        patch

        Show
        Peng Cheng added a comment - patch
        Hide
        Peng Cheng added a comment -

        The next step would be to create an online version of this (and recommender)
        sgd is an online algorithm but now works only for batch recommender.
        In the mean time the only online recommender in mahout is the slope-one, kind of a shame.
        Will create a new JIRA ticket tomorrow.

        Show
        Peng Cheng added a comment - The next step would be to create an online version of this (and recommender) sgd is an online algorithm but now works only for batch recommender. In the mean time the only online recommender in mahout is the slope-one, kind of a shame. Will create a new JIRA ticket tomorrow.
        Hide
        Sebastian Schelter added a comment -

        Hello Peng,

        the code looks very good on a first glimpse. I'd like you to work on it a little more though. Can you format the files according to our code conventions (e.g. no tabs, 2 spaces indent, no braces on next line etc). The code conventions are basically Oracle's standard conventions with 120 chars per line instead of 80.

        Furthermore, could you benchmark your code via a holdout test on a known dataset, maybe movielens1M or movielens10M? That would be awesome. I think this is going to be a great contribution.

        Show
        Sebastian Schelter added a comment - Hello Peng, the code looks very good on a first glimpse. I'd like you to work on it a little more though. Can you format the files according to our code conventions (e.g. no tabs, 2 spaces indent, no braces on next line etc). The code conventions are basically Oracle's standard conventions with 120 chars per line instead of 80. Furthermore, could you benchmark your code via a holdout test on a known dataset, maybe movielens1M or movielens10M? That would be awesome. I think this is going to be a great contribution.
        Hide
        Peng Cheng added a comment -

        Aye aye, more test on the way. Much obliged to the quick suggestion.

        Show
        Peng Cheng added a comment - Aye aye, more test on the way. Much obliged to the quick suggestion.
        Hide
        Peng Cheng added a comment -

        Hey honoured contributors I've got some crude test results for the new parallel SGD factorizer for CF:

        1. parameters:
        lambda = 1e-10
        rank of the rating matrix/number of features of each user/item vectors = 50
        number of biases: 3 (average rating + user bias + item bias)
        number of iterations/epochs = 2 (for all factorizers including ALSWR, ratingSGD and the proposed parallelSGD)
        initial mu/learning rate = 0.01 (for ratingSGD and proposed parallelSGD)
        decay rate of mu = 1 (does not decay) (for ratingSGD and proposed parallelSGD)
        other parameters are set to default.

        2. result on movielens-10m (I don't know what the hell happened to ALSWR, the default hyperparameters must screw up real bad, but my point is the speed edge):
        a. RMSE

        Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With ALSWRFactorizer: 3.7709163950800665E21 time spent: 6.179s===================
        Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With RatingSGDFactorizer: 0.8847393972529887 time spent: 6.179s===================
        Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With ParallelSGDFactorizer: 0.8805947464818478 time spent: 3.084s====================

        b. Absolute Average

        INFO: ==================Recommender With ALSWRFactorizer: 1.2085420449917682E19 time spent: 7.444s===================
        Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With RatingSGDFactorizer: 0.6757777685274206 time spent: 7.444s===================
        Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With ParallelSGDFactorizer: 0.6775774766740665 time spent: 2.365s====================

        3. result on movielens-1m (in average sgd works worse on it comparing to movielens-10m, perhaps I could use more iterations/epochs)

        a. RMSE

        Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With ALSWRFactorizer: 1.3514189134383086E20 time spent: 0.637s===================
        Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With RatingSGDFactorizer: 0.9312989913558529 time spent: 0.637s===================
        Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With ParallelSGDFactorizer: 0.9529995632658007 time spent: 0.305s====================

        b. Absolute Average

        Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With ALSWRFactorizer: 1.58934499216789965E18 time spent: 0.626s===================
        Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With RatingSGDFactorizer: 0.7459565635961599 time spent: 0.626s===================
        Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With ParallelSGDFactorizer: 0.7420818642753416 time spent: 0.297s====================

        Great thanks to Sebastian for his guidance, I'll upload the EvaluatorRunner class as a mahout-example component and the formatted code shortly.

        Show
        Peng Cheng added a comment - Hey honoured contributors I've got some crude test results for the new parallel SGD factorizer for CF: 1. parameters: lambda = 1e-10 rank of the rating matrix/number of features of each user/item vectors = 50 number of biases: 3 (average rating + user bias + item bias) number of iterations/epochs = 2 (for all factorizers including ALSWR, ratingSGD and the proposed parallelSGD) initial mu/learning rate = 0.01 (for ratingSGD and proposed parallelSGD) decay rate of mu = 1 (does not decay) (for ratingSGD and proposed parallelSGD) other parameters are set to default. 2. result on movielens-10m (I don't know what the hell happened to ALSWR, the default hyperparameters must screw up real bad, but my point is the speed edge): a. RMSE Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With ALSWRFactorizer: 3.7709163950800665E21 time spent: 6.179s=================== Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With RatingSGDFactorizer: 0.8847393972529887 time spent: 6.179s=================== Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With ParallelSGDFactorizer: 0.8805947464818478 time spent: 3.084s==================== b. Absolute Average INFO: ==================Recommender With ALSWRFactorizer: 1.2085420449917682E19 time spent: 7.444s=================== Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With RatingSGDFactorizer: 0.6757777685274206 time spent: 7.444s=================== Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With ParallelSGDFactorizer: 0.6775774766740665 time spent: 2.365s==================== 3. result on movielens-1m (in average sgd works worse on it comparing to movielens-10m, perhaps I could use more iterations/epochs) a. RMSE Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With ALSWRFactorizer: 1.3514189134383086E20 time spent: 0.637s=================== Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With RatingSGDFactorizer: 0.9312989913558529 time spent: 0.637s=================== Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With ParallelSGDFactorizer: 0.9529995632658007 time spent: 0.305s==================== b. Absolute Average Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With ALSWRFactorizer: 1.58934499216789965E18 time spent: 0.626s=================== Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With RatingSGDFactorizer: 0.7459565635961599 time spent: 0.626s=================== Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With ParallelSGDFactorizer: 0.7420818642753416 time spent: 0.297s==================== Great thanks to Sebastian for his guidance, I'll upload the EvaluatorRunner class as a mahout-example component and the formatted code shortly.
        Hide
        Peng Cheng added a comment -

        My laptop is a HP Pavilion with Intel® Core™ i7-3610QM CPU @ 2.30GHz × 8 and 8G mem.

        Show
        Peng Cheng added a comment - My laptop is a HP Pavilion with Intel® Core™ i7-3610QM CPU @ 2.30GHz × 8 and 8G mem.
        Hide
        Peng Cheng added a comment -

        Hi Sebastian may I ask question? I digged some old post and found that the best result should be RMSE ~= 0.85, do you know the parameters being used?

        Show
        Peng Cheng added a comment - Hi Sebastian may I ask question? I digged some old post and found that the best result should be RMSE ~= 0.85, do you know the parameters being used?
        Hide
        Peng Cheng added a comment - - edited

        New parameter:
        lambda = 0.001
        rank of the rating matrix/number of features of each user/item vectors = 5
        number of iterations/epochs = 20

        result on movielens-10m, all evaluation uses RMSE:
        Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With RatingSGDFactorizer: 0.8119081937625745 time spent: 36.509s===================
        Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With ParallelSGDFactorizer: 0.8115207244832938 time spent: 8.747s====================

        This is fast and accurate enough, I'm advancing to netflix prize dataset.

        Show
        Peng Cheng added a comment - - edited New parameter: lambda = 0.001 rank of the rating matrix/number of features of each user/item vectors = 5 number of iterations/epochs = 20 result on movielens-10m, all evaluation uses RMSE: Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With RatingSGDFactorizer: 0.8119081937625745 time spent: 36.509s=================== Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With ParallelSGDFactorizer: 0.8115207244832938 time spent: 8.747s==================== This is fast and accurate enough, I'm advancing to netflix prize dataset.
        Hide
        Sebastian Schelter added a comment -

        Hi Peng,

        I also played with your code and the results look very good, its blazingly fast compared to ALS (which has to solve lots of linear systems). The formatting is not completely correct, but I can take over this part. Not sure, if the patch makes it into the current release (0.8), but we will definitely include it for 0.9. Thank you for this contribution.

        Show
        Sebastian Schelter added a comment - Hi Peng, I also played with your code and the results look very good, its blazingly fast compared to ALS (which has to solve lots of linear systems). The formatting is not completely correct, but I can take over this part. Not sure, if the patch makes it into the current release (0.8), but we will definitely include it for 0.9. Thank you for this contribution.
        Hide
        Peng Cheng added a comment -

        Hi Sebastian,

        Really? I would break my fingers to squeeze into 0.8 release. (not RC1 of course, but there is still RC2 :->) A few guys I work with are also kicking me for the online recommender, so I can work very hard and undistracted. You just tell me what to do next and I'll be thrilled to oblige.

        Show
        Peng Cheng added a comment - Hi Sebastian, Really? I would break my fingers to squeeze into 0.8 release. (not RC1 of course, but there is still RC2 :->) A few guys I work with are also kicking me for the online recommender, so I can work very hard and undistracted. You just tell me what to do next and I'll be thrilled to oblige.
        Hide
        Sebastian Schelter added a comment -

        Lets see what we can do to get this into 0.8. The online recommender will definitely be out of scope for 0.8. But its an interesting project for 0.9!

        Show
        Sebastian Schelter added a comment - Lets see what we can do to get this into 0.8. The online recommender will definitely be out of scope for 0.8. But its an interesting project for 0.9!
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2135 (See https://builds.apache.org/job/Mahout-Quality/2135/)
        MAHOUT-1272 Parallel SGD matrix factorizer for SVDrecommender (Revision 1500553)

        Result = SUCCESS
        ssc :
        Files :

        • /mahout/trunk/CHANGELOG
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/svd/ParallelSGDFactorizer.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/svd/ParallelSGDFactorizerTest.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2135 (See https://builds.apache.org/job/Mahout-Quality/2135/ ) MAHOUT-1272 Parallel SGD matrix factorizer for SVDrecommender (Revision 1500553) Result = SUCCESS ssc : Files : /mahout/trunk/CHANGELOG /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/svd/ParallelSGDFactorizer.java /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/svd/ParallelSGDFactorizerTest.java
        Hide
        Peng Cheng added a comment - - edited

        Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own you this.
        I'll test more grouplens data. Since Sebastian has taken over the code, new test cases will only be posted as code snippets.

        Show
        Peng Cheng added a comment - - edited Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own you this. I'll test more grouplens data. Since Sebastian has taken over the code, new test cases will only be posted as code snippets.
        Hide
        Peng Cheng added a comment - - edited

        Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti is a czech dating website.
        This dataset has been used in a live example described in book 'Mahout in Action', page 71, written by a few guys hanging around this site.

        parameters:
        private final static double lambda = 0.1;
        private final static int rank = 16;

        private static int numALSIterations=5;
        private static int numEpochs=20;

        (for ratingSGD)
        double randomNoise=0.02;
        double learningRate=0.01;
        double learningDecayRate=1;

        (for parallelSGD)
        double mu0=1;
        double decayFactor=1;
        int stepOffset=100;
        double forgettingExponent=-1;

        result (using average absolute difference, the rating is based on a 1-10 scale):

        INFO: ==================Recommender With ALSWRFactorizer: 1.5623366369454739 time spent: 41.24s=================== (should be noted the number of ALS iteration is much smaller than others, which leads to suboptimal result, but this is not the point of this test)
        Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With RatingSGDFactorizer: 1.28022379922957 time spent: 118.188s===================
        Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
        INFO: ==================Recommender With ParallelSGDFactorizer: 1.2798905733917445 time spent: 21.806s====================

        This is already the best result I can get, the original book claims a best result of 1.12 on this dataset, which I never achieve. If you have also experimented and find a better parameter set, please post here.

        Show
        Peng Cheng added a comment - - edited Test on libimseti dataset ( http://www.occamslab.com/petricek/data/ ), libimseti is a czech dating website. This dataset has been used in a live example described in book 'Mahout in Action', page 71, written by a few guys hanging around this site. parameters: private final static double lambda = 0.1; private final static int rank = 16; private static int numALSIterations=5; private static int numEpochs=20; (for ratingSGD) double randomNoise=0.02; double learningRate=0.01; double learningDecayRate=1; (for parallelSGD) double mu0=1; double decayFactor=1; int stepOffset=100; double forgettingExponent=-1; result (using average absolute difference, the rating is based on a 1-10 scale): INFO: ==================Recommender With ALSWRFactorizer: 1.5623366369454739 time spent: 41.24s=================== (should be noted the number of ALS iteration is much smaller than others, which leads to suboptimal result, but this is not the point of this test) Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With RatingSGDFactorizer: 1.28022379922957 time spent: 118.188s=================== Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==================Recommender With ParallelSGDFactorizer: 1.2798905733917445 time spent: 21.806s==================== This is already the best result I can get, the original book claims a best result of 1.12 on this dataset, which I never achieve. If you have also experimented and find a better parameter set, please post here.
        Hide
        Peng Cheng added a comment -

        here is the component for testing on libimseti dataset

        Show
        Peng Cheng added a comment - here is the component for testing on libimseti dataset
        Hide
        Peng Cheng added a comment -

        Runnable component for testing ParallelSGDFactorizer on netflix training dataset (yeah, only the trainingSet generated by NetflixDatasetConverter, I cannot get judging.txt for validation, but my purpose is just to test its efficiency on extreme scale, so whatever).

        Warning! To run it without danger you need to allocate at least 12G of heap space to jvm by using the following VM parameters:

        -Xms12288M -Xmx12288M.

        In addition, 16G+ RAM is MANDATORY otherwise either garbage collection or swap will kill you (or both). I almost burned my laptop on this (which has only 8G RAM). As a result, I won't be able to post any result before I can get a better machine. But since its number of rating is about 6 times the size of the movielens-10m or libimseti dataset, and SGD scales linearly to this number, I estimate the running time to be between 2.5-3 minutes.

        I will be utmost obliged to anybody who can try it and post the result here (of course, if your machine can handle it). But obviously as Sebastian has pointed out, our FileDataModel needs some serious optimization to handle such scale.

        Hey Sebastian, can you try this out in your lab? That will be most helpful.

        Show
        Peng Cheng added a comment - Runnable component for testing ParallelSGDFactorizer on netflix training dataset (yeah, only the trainingSet generated by NetflixDatasetConverter, I cannot get judging.txt for validation, but my purpose is just to test its efficiency on extreme scale, so whatever). Warning! To run it without danger you need to allocate at least 12G of heap space to jvm by using the following VM parameters: -Xms12288M -Xmx12288M. In addition, 16G+ RAM is MANDATORY otherwise either garbage collection or swap will kill you (or both). I almost burned my laptop on this (which has only 8G RAM). As a result, I won't be able to post any result before I can get a better machine. But since its number of rating is about 6 times the size of the movielens-10m or libimseti dataset, and SGD scales linearly to this number, I estimate the running time to be between 2.5-3 minutes. I will be utmost obliged to anybody who can try it and post the result here (of course, if your machine can handle it). But obviously as Sebastian has pointed out, our FileDataModel needs some serious optimization to handle such scale. Hey Sebastian, can you try this out in your lab? That will be most helpful.
        Hide
        Sebastian Schelter added a comment -

        I think we should rework the datamodel first. It makes no sense to have to allocate 12GB heap for a 1GB dataset.

        Show
        Sebastian Schelter added a comment - I think we should rework the datamodel first. It makes no sense to have to allocate 12GB heap for a 1GB dataset.

          People

          • Assignee:
            Sean Owen
            Reporter:
            Peng Cheng
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 336h
              336h
              Remaining:
              Remaining Estimate - 336h
              336h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development