Mahout
  1. Mahout
  2. MAHOUT-759

improve the output for ItemSimilarityJob

    Details

      Description

      Now the output of ItemSimilarityJob like following:

      -7757148334301255842 8179634876330318523 0.003430531732418525
      -7748456450926673883 -4835531939219667484 0.2
      -7748456450926673883 -4314955996498817413 0.5
      -7748456450926673883 2808714190706572296 0.16666666666666666
      -7748456450926673883 6553837338030757853 0.14285714285714285
      -7748456450926673883 8751415108300656176 0.25
      -7747582778903926086 -7015341798833970389 0.05
      -7745456649800833279 -4355275072474512298 4.2444821731748726E-4
      -7743453627722079138 -3667977661496669483 0.0625
      -7743453627722079138 5506208171850960507 0.0625
      -7743453627722079138 7221367701058721462 0.0625
      -7721326863046534787 4345458182369739840 0.1111111111111111

      It's hard to store and view those similar items for one item. can we traverse them same as RecommenderJob like following:

      -9220680374247203656 [1352180348488328600:2.5,-7757148334301255842:2.5,-7490490145790861630:2.5,-2522983126042570313:2.5,-6799281597153282746:2.5,2068144185705723774:2.5,-6007350693723349387:2.5,-6926986971196173463:2.5,5406899818760113425:2.5,-1490410533166829581:2.5,-27094582027403342:2.5,5665136340246000627:2.5]
      -9218599019595753787 [7535853797920985421:2.5,6375444791143058470:2.5,-6278686364859964742:2.5,4842183991621375854:2.5,-5371123101058190798:2.5,8606934083257321678:2.5,8043580185091202137:2.5,5264973095582397115:2.5,1990532764981555035:2.5,5406899818760113425:2.5,-5208048021997301514:2.5,-5565838412826072017:2.5]

        Activity

        Hide
        Sebastian Schelter added a comment -

        The output is not thought to be viewed, it's formatted to be usable with org.apache.mahout.cf.taste.impl.similarity.file.FileItemSimilarity and it's also compatible with the way the similarities are stored in a database (one pair per record).

        How would you like the output to be and why do you need it that way?

        Show
        Sebastian Schelter added a comment - The output is not thought to be viewed, it's formatted to be usable with org.apache.mahout.cf.taste.impl.similarity.file.FileItemSimilarity and it's also compatible with the way the similarities are stored in a database (one pair per record). How would you like the output to be and why do you need it that way?
        Hide
        Sean Owen added a comment -

        I think the suggestion is to output one record per item with a bunch of item-similarity data points as one big value, instead of one per item-item pair.

        Since the output is sorted, these are pretty equivalent. Han, the output is sorted by item ID, so you will read all similarities for one item together – does that help?

        I can imagine small arguments for breaking it up by item-item pair or not. Without a strong reason to change I'd leave it – unless I've missed some good argument why the other format is better.

        Show
        Sean Owen added a comment - I think the suggestion is to output one record per item with a bunch of item-similarity data points as one big value, instead of one per item-item pair. Since the output is sorted, these are pretty equivalent. Han, the output is sorted by item ID, so you will read all similarities for one item together – does that help? I can imagine small arguments for breaking it up by item-item pair or not. Without a strong reason to change I'd leave it – unless I've missed some good argument why the other format is better.
        Hide
        Han Hui Wen added a comment -

        I stored the item's similar items to one Key-Values based Data store .

        It's hard to update one item's similar items,because now it hard to judge one similar item is new .

        If we uses the second style output ,it can easily replace one item's similar items. and easy for other style storage .

        Show
        Han Hui Wen added a comment - I stored the item's similar items to one Key-Values based Data store . It's hard to update one item's similar items,because now it hard to judge one similar item is new . If we uses the second style output ,it can easily replace one item's similar items. and easy for other style storage .
        Hide
        Han Hui Wen added a comment -

        The first style output has issue when update one item's similar items.

        For example ,item A has 10 old similar items. now I run the ItemSimilarityJob again ,it generated 10 new similar items.

        I need update item A 's similar items. for the first style output ,it's hard to judge that one similar item of item A in data store is new (generate this time ,we need insert it ) or is old (it's generated in the previous time ,we need remove them ).

        For second output style ,it's easy to do ,we can just delete all old similar items of item A and then insert the new similar items in one time .

        Show
        Han Hui Wen added a comment - The first style output has issue when update one item's similar items. For example ,item A has 10 old similar items. now I run the ItemSimilarityJob again ,it generated 10 new similar items. I need update item A 's similar items. for the first style output ,it's hard to judge that one similar item of item A in data store is new (generate this time ,we need insert it ) or is old (it's generated in the previous time ,we need remove them ). For second output style ,it's easy to do ,we can just delete all old similar items of item A and then insert the new similar items in one time .
        Hide
        Sean Owen added a comment -

        That seems the same though, you can read all item-item similarities for one item either way?

        Show
        Sean Owen added a comment - That seems the same though, you can read all item-item similarities for one item either way?
        Hide
        Han Hui Wen added a comment - - edited

        yep,

        1) the end user can add another M/R task after ItemSimilarityJob to traverse the output to second style if ItemSimilarityJob can not do this .

        2) Or change http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/similarity/item/MostSimilarItemPairsMapper.java?view=markup

        change:

        69 long itemID = indexItemIDMap.get(itemIDIndex);
        70 for (SimilarItem similarItem : topKMostSimilarItems.retrieve()) {
        71 long otherItemID = similarItem.getItemID();
        72 if (itemID < otherItemID)

        { 73 ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); 74 }

        else

        { 75 ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity())); 76 }

        77 }

        to :

        69 long itemID = indexItemIDMap.get(itemIDIndex);
        70 context.write(itemID , new RecommendedItemsWritable(topKMostSimilarItems.retrieve()));

        3) or writer another mapper for ItemSimilarityJob .

        Show
        Han Hui Wen added a comment - - edited yep, 1) the end user can add another M/R task after ItemSimilarityJob to traverse the output to second style if ItemSimilarityJob can not do this . 2) Or change http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/similarity/item/MostSimilarItemPairsMapper.java?view=markup change: 69 long itemID = indexItemIDMap.get(itemIDIndex); 70 for (SimilarItem similarItem : topKMostSimilarItems.retrieve()) { 71 long otherItemID = similarItem.getItemID(); 72 if (itemID < otherItemID) { 73 ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); 74 } else { 75 ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity())); 76 } 77 } to : 69 long itemID = indexItemIDMap.get(itemIDIndex); 70 context.write(itemID , new RecommendedItemsWritable(topKMostSimilarItems.retrieve())); 3) or writer another mapper for ItemSimilarityJob .
        Hide
        Sean Owen added a comment -

        Why would another M/R job be needed? a similarly simple change to the reader code works. There's really little difference between these output choices.

        Show
        Sean Owen added a comment - Why would another M/R job be needed? a similarly simple change to the reader code works. There's really little difference between these output choices.
        Hide
        Han Hui Wen added a comment -

        yep,the change is little .

        But if mahout does not support ,the end user need another M/R to traverse the output .

        Show
        Han Hui Wen added a comment - yep,the change is little . But if mahout does not support ,the end user need another M/R to traverse the output .
        Hide
        Sean Owen added a comment -

        This is not so. Whatever process reads the output you imagine can read the current output. The same data is presented sequentially in the file. It does not need any additional M/R.

        Show
        Sean Owen added a comment - This is not so. Whatever process reads the output you imagine can read the current output. The same data is presented sequentially in the file. It does not need any additional M/R.
        Hide
        Han Hui Wen added a comment - - edited

        In http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/similarity/item/MostSimilarItemPairsMapper.java?view=markup

        69 long itemID = indexItemIDMap.get(itemIDIndex);
        70 for (SimilarItem similarItem : topKMostSimilarItems.retrieve()) {
        71 long otherItemID = similarItem.getItemID();
        72 if (itemID < otherItemID)

        { 73 ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); 74 }

        else

        { 75 ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity())); 76 }

        77 }

        because here only get the similar items sequentially that is grate than the item's itemId.

        So if we need get all item's similar items (both the similar items that are great than the item and
        the similar items that are less than the item ) ,we have to hold them in the memory ,if here has
        huge data ,it need big memory .

        Show
        Han Hui Wen added a comment - - edited In http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/similarity/item/MostSimilarItemPairsMapper.java?view=markup 69 long itemID = indexItemIDMap.get(itemIDIndex); 70 for (SimilarItem similarItem : topKMostSimilarItems.retrieve()) { 71 long otherItemID = similarItem.getItemID(); 72 if (itemID < otherItemID) { 73 ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity())); 74 } else { 75 ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity())); 76 } 77 } because here only get the similar items sequentially that is grate than the item's itemId. So if we need get all item's similar items (both the similar items that are great than the item and the similar items that are less than the item ) ,we have to hold them in the memory ,if here has huge data ,it need big memory .
        Hide
        Sean Owen added a comment -

        Ahh.. I understand your point now. You are correct; I apologize. This is not quite the same.

        This change would also cause this to output similarity between B and A, as well as A and B. This does double the output size, with redundant information. I imagine we'd like to avoid that, but I see why you are saying it is better for your use case.

        Hmm Sebastian do you have a view? I'm trying to work out whether this has other consequences.

        Show
        Sean Owen added a comment - Ahh.. I understand your point now. You are correct; I apologize. This is not quite the same. This change would also cause this to output similarity between B and A, as well as A and B. This does double the output size, with redundant information. I imagine we'd like to avoid that, but I see why you are saying it is better for your use case. Hmm Sebastian do you have a view? I'm trying to work out whether this has other consequences.
        Hide
        Sebastian Schelter added a comment -

        Hmm, I'd like to avoid to change the standard output format as it is nicely integrated with FileItemSimilarity and takes the symmetry of the output into account. However as I think that one of the most important things that Mahout needs currently is ease of use IMHO, I'd propose to add an option to make ItemSimilarity produce the output which Han wishes.

        Show
        Sebastian Schelter added a comment - Hmm, I'd like to avoid to change the standard output format as it is nicely integrated with FileItemSimilarity and takes the symmetry of the output into account. However as I think that one of the most important things that Mahout needs currently is ease of use IMHO, I'd propose to add an option to make ItemSimilarity produce the output which Han wishes.
        Hide
        Han Hui Wen added a comment -

        Thanks ,Sebastian and Sean

        Show
        Han Hui Wen added a comment - Thanks ,Sebastian and Sean
        Hide
        Han Hui Wen added a comment -

        Hi,Sebastian

        Do you have a plan to do the change ?

        Show
        Han Hui Wen added a comment - Hi,Sebastian Do you have a plan to do the change ?
        Hide
        Sebastian Schelter added a comment -

        After reading through this again, I think the best option is to have the user add another M/R job after ItemSimilarityJob. Bringing these into the desired format is not difficult and is bound to the integration with a production system here (atomically updating an entry in a key-value store).

        Show
        Sebastian Schelter added a comment - After reading through this again, I think the best option is to have the user add another M/R job after ItemSimilarityJob. Bringing these into the desired format is not difficult and is bound to the integration with a production system here (atomically updating an entry in a key-value store).

          People

          • Assignee:
            Unassigned
            Reporter:
            Han Hui Wen
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development