Mahout
  1. Mahout
  2. MAHOUT-340

org.apache.mahout.cf.taste.hadoop.cooccurence can not support long as user_id and item_id

    Details

      Description

      I have preferences data using long as user_id and item_id,
      hadoop cooccurence arithmetic can not support it

        Activity

        Hide
        Sean Owen added a comment -

        This is exactly the input format that the entire library uses. What problem are you having? You would need to provide a lot more detail.

        Show
        Sean Owen added a comment - This is exactly the input format that the entire library uses. What problem are you having? You would need to provide a lot more detail.
        Hide
        Han Hui Wen added a comment -

        I used long type as the user_id and item_id ,
        it can not parse the input format when run the ItemBigramGenerator job.

        Show
        Han Hui Wen added a comment - I used long type as the user_id and item_id , it can not parse the input format when run the ItemBigramGenerator job.
        Hide
        Sean Owen added a comment -

        Try the implementation in org.apache.mahout.cf.taste.hadoop.item instead, which reads longs and matches the rest of the framework a little more. The two implementations are being merged into the .item implementation.

        I'm marking this as a sort of 'duplicate' of that task, to merge the implementations, since i don't think this implementation will otherwise be updated

        Show
        Sean Owen added a comment - Try the implementation in org.apache.mahout.cf.taste.hadoop.item instead, which reads longs and matches the rest of the framework a little more. The two implementations are being merged into the .item implementation. I'm marking this as a sort of 'duplicate' of that task, to merge the implementations, since i don't think this implementation will otherwise be updated
        Hide
        Han Hui Wen added a comment -

        Thanks for your quick response, I have simulated the cooccurence and make new one that support long for our project.

        do you thinks which one can get better performance ? cooccurence or item?

        Show
        Han Hui Wen added a comment - Thanks for your quick response, I have simulated the cooccurence and make new one that support long for our project. do you thinks which one can get better performance ? cooccurence or item?
        Hide
        Sean Owen added a comment -

        That is a good question – the two implement the same algorithm, but 'cooccurrence' tries to distribute the matrix - user vector multiplication, while 'item' does not. It's not yet clear what's better. You could adapt either one's approach to completing this multiplication.

        The 'item' handles long IDs as inputs. To do this, you need to create a long <-> int mapping between the original long IDs, and the dimensions in the vector or matrix they map to – which must be ints. It collects this information and reverses the transformation later. For this reason, if you need long IDs, you may find it more natural to adapt 'item' since it handles this issue.

        Show
        Sean Owen added a comment - That is a good question – the two implement the same algorithm, but 'cooccurrence' tries to distribute the matrix - user vector multiplication, while 'item' does not. It's not yet clear what's better. You could adapt either one's approach to completing this multiplication. The 'item' handles long IDs as inputs. To do this, you need to create a long <-> int mapping between the original long IDs, and the dimensions in the vector or matrix they map to – which must be ints. It collects this information and reverses the transformation later. For this reason, if you need long IDs, you may find it more natural to adapt 'item' since it handles this issue.
        Hide
        Han Hui Wen added a comment -

        I just replaced all related int type with long type, it works fine and the performance is very good.

        I has another question :
        the original data has 21,545 users totally and has about 640,000 items,
        it can only generate 153,942 recommendations for 2,694 users,
        many users has no recommendations generated

        Show
        Han Hui Wen added a comment - I just replaced all related int type with long type, it works fine and the performance is very good. I has another question : the original data has 21,545 users totally and has about 640,000 items, it can only generate 153,942 recommendations for 2,694 users, many users has no recommendations generated
        Hide
        Sean Owen added a comment -

        I'm not sure what your final implementation looks like, but be careful about moving to longs from ints. It's not a one-line change. If you're just parsing longs, then casting them down to ints to use as dimensions in a vector or matrix, it won't work correctly at all. You'll be truncating long IDs to ints, and then trying to interpret them as long IDs later, but they won't be valid IDs.

        Is that what you did? then I could imagine recommendations being all wrong.

        If you have long IDs, you will need the steps you see in the 'item' implementation. In particular you need the step that generates and saves the long <-> int mappings.

        Show
        Sean Owen added a comment - I'm not sure what your final implementation looks like, but be careful about moving to longs from ints. It's not a one-line change. If you're just parsing longs, then casting them down to ints to use as dimensions in a vector or matrix, it won't work correctly at all. You'll be truncating long IDs to ints, and then trying to interpret them as long IDs later, but they won't be valid IDs. Is that what you did? then I could imagine recommendations being all wrong. If you have long IDs, you will need the steps you see in the 'item' implementation. In particular you need the step that generates and saves the long <-> int mappings.
        Hide
        Han Hui Wen added a comment -

        Thanks your advice, I get your mean, you are right.
        I replaced all related int to long,
        I will compare the result using item and using the reformed one.

        Show
        Han Hui Wen added a comment - Thanks your advice, I get your mean, you are right. I replaced all related int to long, I will compare the result using item and using the reformed one.

          People

          • Assignee:
            Unassigned
            Reporter:
            Han Hui Wen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development