Mahout
  1. Mahout
  2. MAHOUT-359

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob for Boolean recommendation

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4
    • Fix Version/s: 0.4
    • Labels:
      None

      Description

      in some case there has no preference value in the input data ,the preference value is set to zero,then

      RecommenderMapper.class

      @Override
      public void map(LongWritable userID,
      VectorWritable vectorWritable,
      OutputCollector<LongWritable,RecommendedItemsWritable> output,
      Reporter reporter) throws IOException {

      if ((usersToRecommendFor != null) && !usersToRecommendFor.contains(userID.get()))

      { return; }

      Vector userVector = vectorWritable.get();
      Iterator<Vector.Element> userVectorIterator = userVector.iterateNonZero();
      Vector recommendationVector = new RandomAccessSparseVector(Integer.MAX_VALUE, 1000);
      while (userVectorIterator.hasNext()) {
      Vector.Element element = userVectorIterator.next();
      int index = element.index();
      double value = element.get(); //here will get 0.0 for Boolean recommendation
      Vector columnVector;
      try

      { columnVector = cooccurrenceColumnCache.get(new IntWritable(index)); }

      catch (TasteException te) {
      if (te.getCause() instanceof IOException)

      { throw (IOException) te.getCause(); }

      else

      { throw new IOException(te.getCause()); }

      }
      if (columnVector != null)

      { columnVector.times(value).addTo(recommendationVector); //here will set all score value to zero for Boolean recommendation }

      }

      Queue<RecommendedItem> topItems = new PriorityQueue<RecommendedItem>(recommendationsPerUser + 1,
      Collections.reverseOrder());

      Iterator<Vector.Element> recommendationVectorIterator = recommendationVector.iterateNonZero();
      LongWritable itemID = new LongWritable();
      while (recommendationVectorIterator.hasNext()) {
      Vector.Element element = recommendationVectorIterator.next();
      int index = element.index();
      if (userVector.get(index) == 0.0) {
      if (topItems.size() < recommendationsPerUser)

      { indexItemIDMap.get(new IntWritable(index), itemID); topItems.add(new GenericRecommendedItem(itemID.get(), (float) element.get())); }

      else if (element.get() > topItems.peek().getValue())

      { indexItemIDMap.get(new IntWritable(index), itemID); topItems.add(new GenericRecommendedItem(itemID.get(), (float) element.get())); topItems.poll(); }

      }
      }

      List<RecommendedItem> recommendations = new ArrayList<RecommendedItem>(topItems.size());
      recommendations.addAll(topItems);
      Collections.sort(recommendations);
      output.collect(userID, new RecommendedItemsWritable(recommendations));
      }

      so maybe we need a option to distinguish boolean recommendation and slope one recommendation.

      in ToUserVectorReducer.class

      here no need findTopNPrefsCutoff,maybe take all item.

      it's just my thinking ,maybe item is used for slope one only .

        Activity

        Hide
        Han Hui Wen added a comment -

        This works fine ,thanks Sean

        Show
        Han Hui Wen added a comment - This works fine ,thanks Sean
        Hide
        Sean Owen added a comment -

        I think I understand your ideas. I committed a change that better optimized for 'boolean' data. Unfortunately I need to add a command line flag for this: "--booleanData true". But I believe it should work more efficiently. I'd appreciate it if you can try it out. This is very helpful.

        Show
        Sean Owen added a comment - I think I understand your ideas. I committed a change that better optimized for 'boolean' data. Unfortunately I need to add a command line flag for this: "--booleanData true". But I believe it should work more efficiently. I'd appreciate it if you can try it out. This is very helpful.
        Hide
        Han Hui Wen added a comment -

        if it can optimize to avoid the multiplication and avoid to findTopNPrefsCutoff() for Non-existence preferences,
        it will improve the performance.

        also find issue is :

        UserVectorToCooccurrenceMapper can has 2 map task maximumly (I need test more here )

        Show
        Han Hui Wen added a comment - if it can optimize to avoid the multiplication and avoid to findTopNPrefsCutoff() for Non-existence preferences, it will improve the performance. also find issue is : UserVectorToCooccurrenceMapper can has 2 map task maximumly (I need test more here )
        Hide
        Sean Owen added a comment -

        Yes, most of the recommender code makes this distinction. When user preferences are translated to a vector, the 'null' values are necessarily mapped to zero. This makes sense for vectors. It has some implications for the algorithms written to use vectors, but I believe they are fine.

        RandomAccessSparseVector stores only non-zero entries in the 'values' field. iterateNonZero iterates only over entries in 'values'. This is how it skips non-zero entries.

        Is the issue performance – what part in particular seems slow?

        Show
        Sean Owen added a comment - Yes, most of the recommender code makes this distinction. When user preferences are translated to a vector, the 'null' values are necessarily mapped to zero. This makes sense for vectors. It has some implications for the algorithms written to use vectors, but I believe they are fine. RandomAccessSparseVector stores only non-zero entries in the 'values' field. iterateNonZero iterates only over entries in 'values'. This is how it skips non-zero entries. Is the issue performance – what part in particular seems slow?
        Hide
        Han Hui Wen added a comment -

        my problem is how to process the case Non-existence preferences high-performance and more clear.

        how about the following solution about Non-existence preferences?

        we add one option to distinguish Non-existence preferences or existence preferences,
        so it need not consider set preference to 1 or zero and end user can to use this implementation clearly.

        The loop in RecommenderMapper loops only over non-zero values ,maybe I need read the code deeply.

        Iterator<Vector.Element> userVectorIterator = userVector.iterateNonZero();

        I checked source method iterateNonZero of RandomAccessSparseVector

        http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java?view=markup

        but not find how to skip zero items.

        Show
        Han Hui Wen added a comment - my problem is how to process the case Non-existence preferences high-performance and more clear. how about the following solution about Non-existence preferences? we add one option to distinguish Non-existence preferences or existence preferences, so it need not consider set preference to 1 or zero and end user can to use this implementation clearly. The loop in RecommenderMapper loops only over non-zero values ,maybe I need read the code deeply. Iterator<Vector.Element> userVectorIterator = userVector.iterateNonZero(); I checked source method iterateNonZero of RandomAccessSparseVector http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java?view=markup but not find how to skip zero items.
        Hide
        Sean Owen added a comment -

        No preference is represented by the absence of a preference – 'null', maybe. It's not represented by a preference of value 0, normally.

        But yes when put into a user vector, we have to give a value. Non-existence preferences are modelled as a 0. This makes preferences of 0 indistinguishable from no preference, unfortunately, in these Hadoop-based, vector-based implementations, but it doesn't usually cause an issue in practice.

        The loop in RecommenderMapper loops only over non-zero values, so 'value' is never 0 in the line you cite.

        In the case of boolean preferences, all values are 1. (I could optimize this and avoid the multiplication, I guess.) But that's not your issue is it?

        I also agree we can optimize findTopNPrefsCutoff(). For boolean data, the cutoff is 1.0, and all preferences are kept. We might want to keep a random n items. For now, it's not broken right, it's just keeping more data than we might desire.

        Is that resolving your issue? maybe you can otherwise help me understand the problem you are having.

        Show
        Sean Owen added a comment - No preference is represented by the absence of a preference – 'null', maybe. It's not represented by a preference of value 0, normally. But yes when put into a user vector, we have to give a value. Non-existence preferences are modelled as a 0. This makes preferences of 0 indistinguishable from no preference, unfortunately, in these Hadoop-based, vector-based implementations, but it doesn't usually cause an issue in practice. The loop in RecommenderMapper loops only over non-zero values, so 'value' is never 0 in the line you cite. In the case of boolean preferences, all values are 1. (I could optimize this and avoid the multiplication, I guess.) But that's not your issue is it? I also agree we can optimize findTopNPrefsCutoff(). For boolean data, the cutoff is 1.0, and all preferences are kept. We might want to keep a random n items. For now, it's not broken right, it's just keeping more data than we might desire. Is that resolving your issue? maybe you can otherwise help me understand the problem you are having.
        Hide
        Han Hui Wen added a comment -

        sorry for confusing you.

        I have two questions:

        1) if item has no preferences,we normally set the preferences to 0 or null in database:

        if (columnVector != null) { columnVector.times(value).addTo(recommendationVector) will cause all value has same value,

        2) if item has no preferences,we normally set the preferences to 0 or null:

        in ToUserVectorReducer.class we need not select the top N items,becasue they have the same default value (0 or null)

        we can take all items ,it will decrease the calculatation time ,so improve the performance.

        very thanks

        Show
        Han Hui Wen added a comment - sorry for confusing you. I have two questions: 1) if item has no preferences,we normally set the preferences to 0 or null in database: if (columnVector != null) { columnVector.times(value).addTo(recommendationVector) will cause all value has same value, 2) if item has no preferences,we normally set the preferences to 0 or null: in ToUserVectorReducer.class we need not select the top N items,becasue they have the same default value (0 or null) we can take all items ,it will decrease the calculatation time ,so improve the performance. very thanks
        Hide
        Sean Owen added a comment -

        I am not sure I understand the issue yet. This class has nothing to do with slope one.

        Are you looking at this line?

        if (userVector.get(index) == 0.0) {

        This basically asks, did the user express no preference for the item? If so, then the item is recommendable. This ought to work fine for boolean preferences too.

        But then actually, ToItemPrefsMapper assumes you have preference values. That, I can change. There's no need for that assumption – it can default to 1.0f.

        But then I don't understand your comment about ToUserVectorReducer? I think it's still important for scalability to perhaps cap the size of vectors.

        Show
        Sean Owen added a comment - I am not sure I understand the issue yet. This class has nothing to do with slope one. Are you looking at this line? if (userVector.get(index) == 0.0) { This basically asks, did the user express no preference for the item? If so, then the item is recommendable. This ought to work fine for boolean preferences too. But then actually, ToItemPrefsMapper assumes you have preference values. That, I can change. There's no need for that assumption – it can default to 1.0f. But then I don't understand your comment about ToUserVectorReducer? I think it's still important for scalability to perhaps cap the size of vectors.

          People

          • Assignee:
            Sean Owen
            Reporter:
            Han Hui Wen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development