Solr
  1. Solr
  2. SOLR-1869

RemoveDuplicatesTokenFilter doest have expected behaviour

    Details

      Description

      the RemoveDuplicatesTokenFilter seems broken as it initializes its map and attributes at the class level and not within its constructor
      in addition i would think the expected behaviour would be to remove identical terms with the same offset positions, instead it looks like it removes duplicates based on position increment which wont work when using it after something like the edgengram filter. when i posted this to the mailing list even erik hatcher seemed to think thats what this filter was supposed to do...

      attaching a patch that has the expected behaviour and initializes variables in constructor

      1. RemoveDupOffsetTokenFilter.java
        3 kB
        Joe Calderon
      2. RemoveDupOffsetTokenFilterFactory.java
        0.3 kB
        Joe Calderon
      3. SOLR-1869.patch
        3 kB
        Joe Calderon

        Activity

        Hide
        Joe Calderon added a comment -

        im closing this since there is no bug in RemoveDuplicates filter, will post another patch in another ticker if i ever get around to patching the highlighter

        Show
        Joe Calderon added a comment - im closing this since there is no bug in RemoveDuplicates filter, will post another patch in another ticker if i ever get around to patching the highlighter
        Hide
        Joe Calderon added a comment -

        yes, i think modifying the highlighter would be a better solution here...

        Show
        Joe Calderon added a comment - yes, i think modifying the highlighter would be a better solution here...
        Hide
        Robert Muir added a comment -

        this all started because the highlighter was highlighting a term at the same offsets twice,

        Perhaps we should fix this directly in DefaultSolrHighlighter? It already has this TokenStream-sorting filter thats intended to do the following:

        /** Orders Tokens in a window first by their startOffset ascending.
         * endOffset is currently ignored.
         * This is meant to work around fickleness in the highlighter only.  It
         * can mess up token positions and should not be used for indexing or querying.
         */
        

        Maybe the deduplication logic should occur here after it sorts on startOffset?

        Show
        Robert Muir added a comment - this all started because the highlighter was highlighting a term at the same offsets twice, Perhaps we should fix this directly in DefaultSolrHighlighter? It already has this TokenStream-sorting filter thats intended to do the following: /** Orders Tokens in a window first by their startOffset ascending. * endOffset is currently ignored. * This is meant to work around fickleness in the highlighter only. It * can mess up token positions and should not be used for indexing or querying. */ Maybe the deduplication logic should occur here after it sorts on startOffset?
        Hide
        Joe Calderon added a comment -

        "at the same position and Term text as the previous token " is ambiguous, i assumed position to mean same start and end offsets, hence i assumed there was a bug.

        i changed the filter to use CharArraySet, there was already a call to previous.clear() in reset(). Since the filter name is different i attached its accompanying factory.

        this all started because the highlighter was highlighting a term at the same offsets twice, for example if i had a word with a synonym [ex-con,0,6] and [excon,0,5] then ran it through edgengram filter i would end up with two tokens [ex, 0,2] with different position increments, the highlighted snippet was then "<em>ex</em><em>ex</em>-con", i posted this on the mailing list and RemoveDuplicatesTokenFilter was suggested.

        Show
        Joe Calderon added a comment - "at the same position and Term text as the previous token " is ambiguous, i assumed position to mean same start and end offsets, hence i assumed there was a bug. i changed the filter to use CharArraySet, there was already a call to previous.clear() in reset(). Since the filter name is different i attached its accompanying factory. this all started because the highlighter was highlighting a term at the same offsets twice, for example if i had a word with a synonym [ex-con,0,6] and [excon,0,5] then ran it through edgengram filter i would end up with two tokens [ex, 0,2] with different position increments, the highlighted snippet was then "<em>ex</em><em>ex</em>-con", i posted this on the mailing list and RemoveDuplicatesTokenFilter was suggested.
        Hide
        Robert Muir added a comment -

        The CharArrayMap is more performant in lookup, but you are right, we may need posincr.

        we don't need it for the current implementation, as we clear() the chararrayset when we encounter a term of posincr > 0.
        so the set is only a set of "seen terms" at some position.

        Show
        Robert Muir added a comment - The CharArrayMap is more performant in lookup, but you are right, we may need posincr. we don't need it for the current implementation, as we clear() the chararrayset when we encounter a term of posincr > 0. so the set is only a set of "seen terms" at some position.
        Hide
        Uwe Schindler added a comment -

        the RemoveDuplicatesTokenFilter seems broken as it initializes its map and attributes at the class level and not within its constructor

        The filter is correct. That are final instance fields and the autogenerated ctor by javac does the same, so there is no need to move them to ctor. In Lucene/Solr all TokenStreams are done this way, thats our code style for TokenStreams.

        The CharArrayMap is more performant in lookup, but you are right, we may need posincr. In general the Map should really be simply a CharArraySet or HashSet<String> and the check should use contains.

        But I dont understand the rest of the patch.

        Show
        Uwe Schindler added a comment - the RemoveDuplicatesTokenFilter seems broken as it initializes its map and attributes at the class level and not within its constructor The filter is correct. That are final instance fields and the autogenerated ctor by javac does the same, so there is no need to move them to ctor. In Lucene/Solr all TokenStreams are done this way, thats our code style for TokenStreams. The CharArrayMap is more performant in lookup, but you are right, we may need posincr. In general the Map should really be simply a CharArraySet or HashSet<String> and the check should use contains. But I dont understand the rest of the patch.
        Hide
        Robert Muir added a comment -

        Joe, the initialization is the same. I simply prefer to do this right where the attribute is declared, rather than doing it in the ctor (its the same in java!). So this is no problem.

        as far as the behavior, the filter is currently correct:

        A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream.
        

        if you want to instead create a filter that removes duplicates across an entire field, this is really a completely different filter, but it sounds like a useful completely different filter!

        Can you instead create a patch for a separate filter with a different name?

        I think you can start with this patch, but there are a number of issues with this patch though:

        • the map/set is never cleared, so it won't work across reusable tokenstreams. The map/set should be cleared in reset()
        • i would use chararrayset instead of this map, like the current RemoveDuplicatesTokenFilter
        Show
        Robert Muir added a comment - Joe, the initialization is the same. I simply prefer to do this right where the attribute is declared, rather than doing it in the ctor (its the same in java!). So this is no problem. as far as the behavior, the filter is currently correct: A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream. if you want to instead create a filter that removes duplicates across an entire field, this is really a completely different filter, but it sounds like a useful completely different filter! Can you instead create a patch for a separate filter with a different name? I think you can start with this patch, but there are a number of issues with this patch though: the map/set is never cleared, so it won't work across reusable tokenstreams. The map/set should be cleared in reset() i would use chararrayset instead of this map, like the current RemoveDuplicatesTokenFilter

          People

          • Assignee:
            Unassigned
            Reporter:
            Joe Calderon
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development