Mahout
  1. Mahout
  2. MAHOUT-974

org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId

    Details

      Description

      org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob uses integer as userId and itemId,but org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob and org.apache.mahout.cf.taste.hadoop.item.RecommenderJob .use Long as userId and ItemId.

      It's best that ParallelALSFactorizationJob also uses Long as userId and itemId ,so that same dataset can use all the recommendation arithrmetic

      1. MAHOUT-974.patch
        33 kB
        Sebastian Schelter

        Activity

        Hide
        Sebastian Schelter added a comment -

        You are right. The item ID indexing (from longs to ints) that already exists in org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob should be built into ParallelALSFactorizationJob too.

        Show
        Sebastian Schelter added a comment - You are right. The item ID indexing (from longs to ints) that already exists in org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob should be built into ParallelALSFactorizationJob too.
        Hide
        Han Hui Wen added a comment -

        org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob just indexed itemId ,but not index userId.it also converts user preferences into a vector per user and builds the rating matrix.

        Show
        Han Hui Wen added a comment - org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob just indexed itemId ,but not index userId.it also converts user preferences into a vector per user and builds the rating matrix.
        Hide
        Saikat Kanjilal added a comment -

        Sebastien,
        Is this something I can help with, I dont see a patch so am not sure where you are with the fix.
        Let me know
        Regards

        Show
        Saikat Kanjilal added a comment - Sebastien, Is this something I can help with, I dont see a patch so am not sure where you are with the fix. Let me know Regards
        Hide
        Sebastian Schelter added a comment -

        I didn't do any work on it, but it could be a good starters project. You basically have to create a mapping for both user and item ids, which must also be used by related jobs like the RecommenderJob for ALS and the one that evaluates the error of a factorization.

        Show
        Sebastian Schelter added a comment - I didn't do any work on it, but it could be a good starters project. You basically have to create a mapping for both user and item ids, which must also be used by related jobs like the RecommenderJob for ALS and the one that evaluates the error of a factorization.
        Hide
        Saikat Kanjilal added a comment -

        I am reading through the PreparePreferenceMatrixJob and I was wondering if by mapping between longs to ints you're referring to the following lines of code:
        //convert items to an internal index
        Job itemIDIndex = prepareJob(getInputPath(), getOutputPath(ITEMID_INDEX), TextInputFormat.class,
        ItemIDIndexMapper.class, VarIntWritable.class, VarLongWritable.class, ItemIDIndexReducer.class,
        VarIntWritable.class, VarLongWritable.class, SequenceFileOutputFormat.class);
        itemIDIndex.setCombinerClass(ItemIDIndexReducer.class);
        boolean succeeded = itemIDIndex.waitForCompletion(true);
        if (!succeeded)

        { return -1; }

        //convert user preferences into a vector per user
        Job toUserVectors = prepareJob(getInputPath(),
        getOutputPath(USER_VECTORS),
        TextInputFormat.class,
        ToItemPrefsMapper.class,
        VarLongWritable.class,
        booleanData ? VarLongWritable.class : EntityPrefWritable.class,
        ToUserVectorsReducer.class,
        VarLongWritable.class,
        VectorWritable.class,
        SequenceFileOutputFormat.class);

        Pardon my ignorance as this is my first time looking at this code, I dont see any other parts of this class resembling a mapping. Also Sebastian I'm wondering whether the mapping itself needs to be present in mahout-core so that multiple jobs can leverage it.

        Show
        Saikat Kanjilal added a comment - I am reading through the PreparePreferenceMatrixJob and I was wondering if by mapping between longs to ints you're referring to the following lines of code: //convert items to an internal index Job itemIDIndex = prepareJob(getInputPath(), getOutputPath(ITEMID_INDEX), TextInputFormat.class, ItemIDIndexMapper.class, VarIntWritable.class, VarLongWritable.class, ItemIDIndexReducer.class, VarIntWritable.class, VarLongWritable.class, SequenceFileOutputFormat.class); itemIDIndex.setCombinerClass(ItemIDIndexReducer.class); boolean succeeded = itemIDIndex.waitForCompletion(true); if (!succeeded) { return -1; } //convert user preferences into a vector per user Job toUserVectors = prepareJob(getInputPath(), getOutputPath(USER_VECTORS), TextInputFormat.class, ToItemPrefsMapper.class, VarLongWritable.class, booleanData ? VarLongWritable.class : EntityPrefWritable.class, ToUserVectorsReducer.class, VarLongWritable.class, VectorWritable.class, SequenceFileOutputFormat.class); Pardon my ignorance as this is my first time looking at this code, I dont see any other parts of this class resembling a mapping. Also Sebastian I'm wondering whether the mapping itself needs to be present in mahout-core so that multiple jobs can leverage it.
        Hide
        Angel Martinez Gonzalez added a comment -

        Hi Saikat,

        I think the mapping is done in ItemIDIndexMapper, which in turn calls TasteHadoopUtils.idToIndex

        Show
        Angel Martinez Gonzalez added a comment - Hi Saikat, I think the mapping is done in ItemIDIndexMapper, which in turn calls TasteHadoopUtils.idToIndex
        Hide
        Saikat Kanjilal added a comment -

        Thanks for the update, I'll look into this, I'm guessing the fix needs to be made inside the TasteHadoopUtils class

        Show
        Saikat Kanjilal added a comment - Thanks for the update, I'll look into this, I'm guessing the fix needs to be made inside the TasteHadoopUtils class
        Hide
        Sebastian Schelter added a comment -

        Saikat, are you still on this?

        Show
        Sebastian Schelter added a comment - Saikat, are you still on this?
        Hide
        Saikat Kanjilal added a comment -

        Yes, although I could use some general guidance being a newbie on this codebase, I've not had time to research this further, can you respond to my comments above?

        Thanks

        Show
        Saikat Kanjilal added a comment - Yes, although I could use some general guidance being a newbie on this codebase, I've not had time to research this further, can you respond to my comments above? Thanks
        Hide
        Sebastian Schelter added a comment -

        Saikat,

        In the preprocessing code of the ALS job (the first two mapreduces), you would need to hash the long ids to ints, ideally using the MultipleOutputs API so that we don't need additional jobs. The mapping needs to be stored together with the factorization and must be used in the PredictionJob which uses the factorization to predict interactions. It has to map back the ints to longs.

        Show
        Sebastian Schelter added a comment - Saikat, In the preprocessing code of the ALS job (the first two mapreduces), you would need to hash the long ids to ints, ideally using the MultipleOutputs API so that we don't need additional jobs. The mapping needs to be stored together with the factorization and must be used in the PredictionJob which uses the factorization to predict interactions. It has to map back the ints to longs.
        Hide
        Saikat Kanjilal added a comment -

        Sebastian,
        Finally had a chance to dig into this further tonight, so in looking at the first two map-reduces I see the ItemRatingVectorsMapper class, 2 ideas here: 1) should we get rid of this class and just use the ItemIDIndexMapper class and try to make this class work for ALS 2) make ItemRatingVectorsMapper handle the mapping, unlike ItemIDIndexMapper this class doesnt really handle an index and deals with the rating matrix which itself would need to be modified.

        Any thoughts on simplest solution? My vote would be 2 but I need to read through the code some more to get a deeper understanding. Also please pardon if I'm way off base on solutioning this ), lot of code to read and understand

        Show
        Saikat Kanjilal added a comment - Sebastian, Finally had a chance to dig into this further tonight, so in looking at the first two map-reduces I see the ItemRatingVectorsMapper class, 2 ideas here: 1) should we get rid of this class and just use the ItemIDIndexMapper class and try to make this class work for ALS 2) make ItemRatingVectorsMapper handle the mapping, unlike ItemIDIndexMapper this class doesnt really handle an index and deals with the rating matrix which itself would need to be modified. Any thoughts on simplest solution? My vote would be 2 but I need to read through the code some more to get a deeper understanding. Also please pardon if I'm way off base on solutioning this ), lot of code to read and understand
        Hide
        Sebastian Schelter added a comment -

        Hi Saikat,

        The first two jobs create two versions of the ratings matrix, one partitioned by items, the other partitioned by users. The most elegant solution for this issue would be to make these jobs write out the mapping of ints to long ids via an emulation of MultipleOutputs such as used in org.apache.mahout.math.hadoop.stochasticsvd.ABtJob

        I suggest we add an argument "usesLongIDs" to the job that the user can set to trigger the mapping.

        Show
        Sebastian Schelter added a comment - Hi Saikat, The first two jobs create two versions of the ratings matrix, one partitioned by items, the other partitioned by users. The most elegant solution for this issue would be to make these jobs write out the mapping of ints to long ids via an emulation of MultipleOutputs such as used in org.apache.mahout.math.hadoop.stochasticsvd.ABtJob I suggest we add an argument "usesLongIDs" to the job that the user can set to trigger the mapping.
        Hide
        Saikat Kanjilal added a comment -

        Sebastien,
        In looking at ABtJob I see MultipleOutputs commented out, I tried to do a search for this class and it doesnt exist, is this more of a concept than an actual class?

        Show
        Saikat Kanjilal added a comment - Sebastien, In looking at ABtJob I see MultipleOutputs commented out, I tried to do a search for this class and it doesnt exist, is this more of a concept than an actual class?
        Hide
        Sebastian Schelter added a comment -

        Saikat,

        I've had a deeper look and I think I'll take this issue. It's an ugly thing and lots of small places in the code need to be updated... Nothing fancy for someone not deeply familiar with the codebase.

        If you still wanna work on the ALS code, we should discuss this on the mailinglist. I have a few ideas what could be added for upcoming releases.

        Show
        Sebastian Schelter added a comment - Saikat, I've had a deeper look and I think I'll take this issue. It's an ugly thing and lots of small places in the code need to be updated... Nothing fancy for someone not deeply familiar with the codebase. If you still wanna work on the ALS code, we should discuss this on the mailinglist. I have a few ideas what could be added for upcoming releases.
        Hide
        Sebastian Schelter added a comment -

        Patch that adds the functionality. Tests don't work in parallel mode for some strange reason though, in single/IDE execution, everything works. Will probably look into this tonight or tomorrow.

        Show
        Sebastian Schelter added a comment - Patch that adds the functionality. Tests don't work in parallel mode for some strange reason though, in single/IDE execution, everything works. Will probably look into this tonight or tomorrow.
        Hide
        Sebastian Schelter added a comment -

        test issue is fixed, problem was that two test classes used the same class with a static final variable.

        Show
        Sebastian Schelter added a comment - test issue is fixed, problem was that two test classes used the same class with a static final variable.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2056 (See https://builds.apache.org/job/Mahout-Quality/2056/)
        MAHOUT-974 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId (Revision 1490930)

        Result = FAILURE
        ssc :
        Files :

        • /mahout/trunk/CHANGELOG
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/MutableRecommendedItem.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/TasteHadoopUtils.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/FactorizationEvaluator.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/ParallelALSFactorizationJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/PredictionMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/AggregateAndRecommendReducer.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/hadoop/als/ParallelALSFactorizationJobTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/MathHelper.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2056 (See https://builds.apache.org/job/Mahout-Quality/2056/ ) MAHOUT-974 org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId (Revision 1490930) Result = FAILURE ssc : Files : /mahout/trunk/CHANGELOG /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/MutableRecommendedItem.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/TasteHadoopUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/FactorizationEvaluator.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/ParallelALSFactorizationJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/PredictionMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/AggregateAndRecommendReducer.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.java /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/hadoop/als/ParallelALSFactorizationJobTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/MathHelper.java

          People

          • Assignee:
            Sebastian Schelter
            Reporter:
            Han Hui Wen
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 2h
              2h
              Remaining:
              Remaining Estimate - 2h
              2h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development