Lucene - Core
  1. Lucene - Core
  2. LUCENE-4822

Add PatternKeywordTokenFilter to marks keywords based on regular expressions

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.2
    • Fix Version/s: 4.3, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      today we need to pass in an explicit set of terms that we want to marks as keywords. It might make sense to allow patterns as well to prevent certain suffixes etc. to be keyworded.

      1. LUCENE-4822.patch
        142 kB
        Simon Willnauer
      2. LUCENE-4822.patch
        143 kB
        Uwe Schindler
      3. LUCENE-4822.patch
        143 kB
        Uwe Schindler
      4. LUCENE-4822.patch
        140 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        here is a patch (pretty big though). I made KeywordMarkerFilter an abstract base class and added two subclasses PatternKeywordMarkerFilter and SetKeywordMarkerFilter. I also updated the factory to support both based on given arguments. I think this can be useful for others as well if they need to make keyword decisions based on more complex logic.

        Show
        Simon Willnauer added a comment - here is a patch (pretty big though). I made KeywordMarkerFilter an abstract base class and added two subclasses PatternKeywordMarkerFilter and SetKeywordMarkerFilter. I also updated the factory to support both based on given arguments. I think this can be useful for others as well if they need to make keyword decisions based on more complex logic.
        Hide
        Uwe Schindler added a comment -

        In my opinion, the abstract isKeyword() methods should take no arguments and use the CharTermAttribute directly (it would then be consistent to the accept() method of the abstract base class FilteringTokenFilter of e.g. StopFilter). We would also need no spare CharsRef, as CharTermAttribute implements CharSequence, so Matcher can handle it directly without cloning.

        The explicit access to the buffer, offset and length is implementation specific to only the one backed by CharArraySet.

        Show
        Uwe Schindler added a comment - In my opinion, the abstract isKeyword() methods should take no arguments and use the CharTermAttribute directly (it would then be consistent to the accept() method of the abstract base class FilteringTokenFilter of e.g. StopFilter). We would also need no spare CharsRef, as CharTermAttribute implements CharSequence, so Matcher can handle it directly without cloning. The explicit access to the buffer, offset and length is implementation specific to only the one backed by CharArraySet.
        Hide
        Uwe Schindler added a comment -

        In addition: CharArraySet allows to do "contains(CharSequence)", so we dont need to pass the array directly. contains(termAtt) is enough to check for existence in the set.

        If you dont want to remove the parameter of isKeyword, make it CharSequence.

        Show
        Uwe Schindler added a comment - In addition: CharArraySet allows to do "contains(CharSequence)", so we dont need to pass the array directly. contains(termAtt) is enough to check for existence in the set. If you dont want to remove the parameter of isKeyword, make it CharSequence.
        Hide
        Simon Willnauer added a comment -

        ah uwe that makes perfect sense. I will update

        Show
        Simon Willnauer added a comment - ah uwe that makes perfect sense. I will update
        Hide
        Uwe Schindler added a comment -

        Patch with my suggestions.

        Show
        Uwe Schindler added a comment - Patch with my suggestions.
        Hide
        Uwe Schindler added a comment -

        The good thing with isKeyword being parameter-free is the possibility to also mark "keywords" based on other attributes (e.g. when Kumoroji sets a specific base form, or if a token has no position increment, or if you have a certain unicode range like arabic chars that should never be passed to a stemmer,...; the examples maybe nonsense but shows the possibilities).

        Show
        Uwe Schindler added a comment - The good thing with isKeyword being parameter-free is the possibility to also mark "keywords" based on other attributes (e.g. when Kumoroji sets a specific base form, or if a token has no position increment, or if you have a certain unicode range like arabic chars that should never be passed to a stemmer,...; the examples maybe nonsense but shows the possibilities).
        Hide
        Simon Willnauer added a comment -

        good stuff +1 you were like 2 seconds faster than I was

        Show
        Simon Willnauer added a comment - good stuff +1 you were like 2 seconds faster than I was
        Hide
        Uwe Schindler added a comment -

        small change for safety: explicitely cast the termAttribute to CharSequence to be sure, the right (and most effective method) is called in CharArraySet. Otherwise it might happen that contains(Object) is called.

        Show
        Uwe Schindler added a comment - small change for safety: explicitely cast the termAttribute to CharSequence to be sure, the right (and most effective method) is called in CharArraySet. Otherwise it might happen that contains(Object) is called.
        Hide
        Simon Willnauer added a comment -

        added changes.txt entries and markers for BWcompat, made PatternKeywordTokenFilter final and used the array based lookup method in CharArraySet to prevent any confusion.

        I think it's ready

        Show
        Simon Willnauer added a comment - added changes.txt entries and markers for BWcompat, made PatternKeywordTokenFilter final and used the array based lookup method in CharArraySet to prevent any confusion. I think it's ready
        Hide
        Uwe Schindler added a comment -

        +1 - the array based method may be faster on some JVMs, so using the array for the CharArraySet based lookup might be a good idea.

        I am still not sure, if we should break backwards in 4.x. I am happy that the factory did not change its name, so solr is not affcted.

        Show
        Uwe Schindler added a comment - +1 - the array based method may be faster on some JVMs, so using the array for the CharArraySet based lookup might be a good idea. I am still not sure, if we should break backwards in 4.x. I am happy that the factory did not change its name, so solr is not affcted.
        Hide
        Simon Willnauer added a comment -

        I am still not sure, if we should break backwards in 4.x. I am happy that the factory did not change its name, so solr is not affcted.

        I am not too concerned about this really. it's a module and something that is super easy to fix for users. I wouldn't worry about it too much I think nameing and abstraction is more important here.

        Show
        Simon Willnauer added a comment - I am still not sure, if we should break backwards in 4.x. I am happy that the factory did not change its name, so solr is not affcted. I am not too concerned about this really. it's a module and something that is super easy to fix for users. I wouldn't worry about it too much I think nameing and abstraction is more important here.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Simon Willnauer
        http://svn.apache.org/viewvc?view=revision&revision=1455321

        LUCENE-4822: Add PatternKeywordTokenFilter to marks keywords based on regular expressions

        Show
        Commit Tag Bot added a comment - [trunk commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1455321 LUCENE-4822 : Add PatternKeywordTokenFilter to marks keywords based on regular expressions
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Simon Willnauer
        http://svn.apache.org/viewvc?view=revision&revision=1455325

        LUCENE-4822: Add missing files - I blame git...

        Show
        Commit Tag Bot added a comment - [trunk commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1455325 LUCENE-4822 : Add missing files - I blame git...
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Simon Willnauer
        http://svn.apache.org/viewvc?view=revision&revision=1455329

        LUCENE-4822: Add PatternKeywordTokenFilter to marks keywords based on regular expressions

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1455329 LUCENE-4822 : Add PatternKeywordTokenFilter to marks keywords based on regular expressions
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development