Solr
  1. Solr
  2. SOLR-2800

optimize RemoveDuplicatesTokenFilterFactory

    Details

      Description

      Using RemoveDuplicatesTokenFilterFactory can not remove the duplicated term.

      in http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_4/solr/core/src/java/org/apache/solr/analysis/RemoveDuplicatesTokenFilter.java?view=markup

      @Override
      53 public boolean incrementToken() throws IOException {
      54 while (input.incrementToken()) {
      55 final char term[] = termAttribute.buffer();
      56 final int length = termAttribute.length();
      57 final int posIncrement = posIncAttribute.getPositionIncrement();
      58
      59 if (posIncrement > 0)

      { 60 previous.clear(); 61 }

      62
      63 boolean duplicate = (posIncrement == 0 && previous.contains(term, 0, length));
      64
      65 // clone the term, and add to the set of seen terms.
      66 char saved[] = new char[length];
      67 System.arraycopy(term, 0, saved, 0, length);
      68 previous.add(saved);
      69
      70 if (!duplicate)

      { 71 return true; 72 }

      73 }
      74 return false;
      75 }

      it should be like following:
      @Override
      public boolean incrementToken() throws IOException {
      while (input.incrementToken()) {
      final char term[] = termAttribute.buffer();
      final int length = termAttribute.length();
      final int posIncrement = posIncAttribute.getPositionIncrement();

      if (posIncrement > 0)

      { previous.clear(); }

      boolean duplicate = (posIncrement == 0 && previous.contains(term, 0, length));

      if(duplicate )

      { return false; }

      else

      { // clone the term, and add to the set of seen terms. char saved[] = new char[length]; System.arraycopy(term, 0, saved, 0, length); previous.add(saved); }

      }
      return true;
      }

        Activity

        Hide
        Steve Rowe added a comment -

        Can you provide a test case? Your version seems more efficient, since duplicates are not placed in the CharArraySet, but the functionality looks the same to me as the original. Your test case should demonstrate what is failing without your changes, and that your changes fix the problem.

        Please provide code/tests in the form of a patch - it's much easier to see what you have changed/added. Also, it's very important when you provide code that you attach a patch to the issue and click where it says "Grant license to ASF for inclusion in ASF works (as per the Apache License §5)" - unless you do this, we can't use your code.

        Show
        Steve Rowe added a comment - Can you provide a test case? Your version seems more efficient, since duplicates are not placed in the CharArraySet, but the functionality looks the same to me as the original. Your test case should demonstrate what is failing without your changes, and that your changes fix the problem. Please provide code/tests in the form of a patch - it's much easier to see what you have changed/added. Also, it's very important when you provide code that you attach a patch to the issue and click where it says "Grant license to ASF for inclusion in ASF works (as per the Apache License §5)" - unless you do this, we can't use your code.
        Hide
        Robert Muir added a comment -

        Your version seems more efficient, since duplicates are not placed in the CharArraySet, but the functionality looks the same to me as the original.

        Well if you have Joe (Posinc=1), Pizza (posinc=0), Joe (posinc=0), Beer (posinc=0), Pizza (posinc=0), will you end up with 1 Joe, 1 Pizza, and 1 Beer (This is how the filter is defined)?

        Thats why the set is needed I think.

        Show
        Robert Muir added a comment - Your version seems more efficient, since duplicates are not placed in the CharArraySet, but the functionality looks the same to me as the original. Well if you have Joe (Posinc=1), Pizza (posinc=0), Joe (posinc=0), Beer (posinc=0), Pizza (posinc=0), will you end up with 1 Joe, 1 Pizza, and 1 Beer (This is how the filter is defined)? Thats why the set is needed I think.
        Hide
        Steve Rowe added a comment -

        I agree the set is needed, but the current code clones and then adds every single term to the set, regardless of whether it's already in the set.

        Show
        Steve Rowe added a comment - I agree the set is needed, but the current code clones and then adds every single term to the set, regardless of whether it's already in the set.
        Hide
        Robert Muir added a comment -

        Yeah thats stupid, we only need to clone the term if it doesnt exist... on the other hand its not too bad since this is the 'common case', but we should fix it.

        Show
        Robert Muir added a comment - Yeah thats stupid, we only need to clone the term if it doesnt exist ... on the other hand its not too bad since this is the 'common case', but we should fix it.
        Hide
        Steve Rowe added a comment -

        Wen, your changes introduce a bug: incrementToken() should return false only at end-of-stream.

        That is, instead of:

        if(duplicate )
        { return false; }
        

        you should use:

        if (duplicate) { continue; }
        
        Show
        Steve Rowe added a comment - Wen, your changes introduce a bug: incrementToken() should return false only at end-of-stream. That is, instead of: if(duplicate ) { return false; } you should use: if (duplicate) { continue; }
        Hide
        Hoss Man added a comment -

        Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19.

        Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited

        Show
        Hoss Man added a comment - Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19. Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited
        Hide
        Hoss Man added a comment -

        bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

        Show
        Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Hoss Man added a comment -

        removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue.

        But also assigning to Robert Muir since if i'm understanding his comments, it seems he thinks there is an easy win here, we just need a test case.

        Show
        Hoss Man added a comment - removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. But also assigning to Robert Muir since if i'm understanding his comments, it seems he thinks there is an easy win here, we just need a test case.
        Hide
        Robert Muir added a comment -

        here's the optimization. basically it saves a clone etc only for additional posinc=0 duplicates that we aren't going to return anyway.

        Show
        Robert Muir added a comment - here's the optimization. basically it saves a clone etc only for additional posinc=0 duplicates that we aren't going to return anyway.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Han Hui Wen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development