Lucene - Core
  1. Lucene - Core
  2. LUCENE-2901

KeywordMarkerFilter resets keyword attribute state to false for tokens not in protwords.txt

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      KeywordMarkerFilter sets true or false for the KeywordAttribute on all tokens. This erases previous state established further up the filter chain, for example in the case where a custom filter wants to prevent a token from being stemmed.

      If a token is already marked as a keyword (KeywordAttribute.isKeyword() == true), perhaps the KeywordMarkerFilterFactory should not re-set the state to false.

      1. LUCENE-2901.patch
        0.8 kB
        Drew Farris
      2. LUCENE-2901.patch
        0.8 kB
        Drew Farris

        Activity

        Hide
        Robert Muir added a comment -

        Hi Drew, I agree this is a bug.

        I think it should simply do what the javadoc says:

        if (keywordSet.contains(word))
          keywordAtt.setKeyword(true);
        
        Show
        Robert Muir added a comment - Hi Drew, I agree this is a bug. I think it should simply do what the javadoc says: if (keywordSet.contains(word)) keywordAtt.setKeyword(true);
        Hide
        Simon Willnauer added a comment -

        If a token is already marked as a keyword (KeywordAttribute.isKeyword() == true), perhaps the KeywordMarkerFilterFactory should not re-set the state to false.

        hey Drew, I think I see you point here but I disagree that we should consider the previous state of the KeywordAttribute in this filter. IMO this filter is very clearly defined and extremely simple. If you want it to have different behavior eg. considering the attrs state you can simply write your own filter that. This one is more or less just a simple PoC how the KeywordAttribute works / should be used etc.

        I think if we change the behavior we will have JIRA issues that say "KeywordMarkerFilter does not reset attribute state to false for tokens not in protwords.txt" - you know what I mean, this one is tricky and special use-cases should have their own kind of trivial impls.

        Show
        Simon Willnauer added a comment - If a token is already marked as a keyword (KeywordAttribute.isKeyword() == true), perhaps the KeywordMarkerFilterFactory should not re-set the state to false. hey Drew, I think I see you point here but I disagree that we should consider the previous state of the KeywordAttribute in this filter. IMO this filter is very clearly defined and extremely simple. If you want it to have different behavior eg. considering the attrs state you can simply write your own filter that. This one is more or less just a simple PoC how the KeywordAttribute works / should be used etc. I think if we change the behavior we will have JIRA issues that say "KeywordMarkerFilter does not reset attribute state to false for tokens not in protwords.txt" - you know what I mean, this one is tricky and special use-cases should have their own kind of trivial impls.
        Hide
        Robert Muir added a comment -

        Simon, i think the key point here is that the code does this:

        keywordAtt.setKeyword(keywordSet.contains(word));
        

        I am suggesting this be changed to:

        if (keywordSet.contains(word))
          keywordAtt.setKeyword(true);
        

        This is more consistent with what the javadoc says it does: it doesn't speak of unsetting any attributes.

        Show
        Robert Muir added a comment - Simon, i think the key point here is that the code does this: keywordAtt.setKeyword(keywordSet.contains(word)); I am suggesting this be changed to: if (keywordSet.contains(word)) keywordAtt.setKeyword(true); This is more consistent with what the javadoc says it does: it doesn't speak of unsetting any attributes.
        Hide
        Simon Willnauer added a comment -

        Simon, i think the key point here is that the code does this:

        so lets not rush on this - As far as I can recall this has not been released right? so we might rather figure out what makes most sense, fixing code or fixing javadoc. if you have 4 filters A < – A_kw < – B <-- B_kw <-- Tokenizer and the term is in B_kw but not in A_kw the current code does the right thing while the change you suggest doesn't IMO. So I wonder if we should rather fix the javadoc than the code. I mean if somebody needs this behavior this is super simple to implement. Or maybe we have an option to reset keyword or not?

        Show
        Simon Willnauer added a comment - Simon, i think the key point here is that the code does this: so lets not rush on this - As far as I can recall this has not been released right? so we might rather figure out what makes most sense, fixing code or fixing javadoc. if you have 4 filters A < – A_kw < – B <-- B_kw <-- Tokenizer and the term is in B_kw but not in A_kw the current code does the right thing while the change you suggest doesn't IMO. So I wonder if we should rather fix the javadoc than the code. I mean if somebody needs this behavior this is super simple to implement. Or maybe we have an option to reset keyword or not?
        Hide
        Robert Muir added a comment -

        Honestly I didnt know the filter did what drew reported, or i would have raised this bug myself.

        The problem is this is inconsistent even with the other filters that use keyword attribute.
        Its more obvious with the decorator pattern that if you have:

        KeywordFilter( { "dogs", "birds", "trees", "houses" })
        

        That this is equivalent to:

        KeywordFilter( { "dogs", "birds" } )
        KeywordFilter( { "trees", "houses" })
        

        This is how all the other tokenstreams work, e.g. StopFilter etc.

        Show
        Robert Muir added a comment - Honestly I didnt know the filter did what drew reported, or i would have raised this bug myself. The problem is this is inconsistent even with the other filters that use keyword attribute. Its more obvious with the decorator pattern that if you have: KeywordFilter( { "dogs", "birds", "trees", "houses" }) That this is equivalent to: KeywordFilter( { "dogs", "birds" } ) KeywordFilter( { "trees", "houses" }) This is how all the other tokenstreams work, e.g. StopFilter etc.
        Hide
        Simon Willnauer added a comment -

        if (keywordSet.contains(word))
        keywordAtt.setKeyword(true);

        I just looked at the code and that change makes sense actually - it should really be consistent to StopFitler etc.

        simon

        Show
        Simon Willnauer added a comment - if (keywordSet.contains(word)) keywordAtt.setKeyword(true); I just looked at the code and that change makes sense actually - it should really be consistent to StopFitler etc. simon
        Hide
        Drew Farris added a comment -

        Revised patch that implements:

        if (keywordSet.contains(word))
          keywordAtt.setKeyword(true);
        
        Show
        Drew Farris added a comment - Revised patch that implements: if (keywordSet.contains(word)) keywordAtt.setKeyword( true );
        Hide
        Robert Muir added a comment -

        Committed revision 1065621, 1065626 (branch_3x)

        Thanks Drew... glad you caught this one!

        Show
        Robert Muir added a comment - Committed revision 1065621, 1065626 (branch_3x) Thanks Drew... glad you caught this one!
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Robert Muir
            Reporter:
            Drew Farris
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development