Lucene - Core
  1. Lucene - Core
  2. LUCENE-4817

Add KeywordRepeaterFilter to emit tokens twice once as keyword and once not as keyword

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.1
    • Fix Version/s: 4.3, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      if you want to have a stemmed and an unstemmed version of a token one for recall and one for precision you have to do two fields today in most of the cases. Yet, most of the stemmers respect the keyword attribute so we could add a token filter that emits the same token twice once as keyword and once plain. Folks would most likely need to combine this RemoveDuplicatesTokenFilter but that way we can have stemmed and unstemmed version in the same field.

      1. docs.patch
        5 kB
        Erick Erickson
      2. docs.patch
        5 kB
        Erick Erickson
      3. LUCENE-4817.patch
        10 kB
        Simon Willnauer
      4. LUCENE-4817.patch
        5 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          here is a simple patch and test

          Show
          Simon Willnauer added a comment - here is a simple patch and test
          Hide
          Uwe Schindler added a comment -

          This sounds like a good idea. In general we should at some place have a general guideline, which type of filters should add things like stems, and which filters should only replace tokens.

          Show
          Uwe Schindler added a comment - This sounds like a good idea. In general we should at some place have a general guideline, which type of filters should add things like stems, and which filters should only replace tokens.
          Hide
          Alan Woodward added a comment -

          +1, I've implemented this about half-a-dozen times in the past six months for various projects

          Show
          Alan Woodward added a comment - +1, I've implemented this about half-a-dozen times in the past six months for various projects
          Hide
          Simon Willnauer added a comment -

          new patch, added a token filter factory, changes entry and added the factory to the services file. I will commit shortly

          Show
          Simon Willnauer added a comment - new patch, added a token filter factory, changes entry and added the factory to the services file. I will commit shortly
          Hide
          Simon Willnauer added a comment -

          adding patch again, this time the right one.

          Show
          Simon Willnauer added a comment - adding patch again, this time the right one.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1454313

          LUCENE-4817: Add KeywordRepeaterFilter to emit tokens twice once as keyword and once not as keyword

          Show
          Commit Tag Bot added a comment - [trunk commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1454313 LUCENE-4817 : Add KeywordRepeaterFilter to emit tokens twice once as keyword and once not as keyword
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1454317

          LUCENE-4817: Add KeywordRepeaterFilter to emit tokens twice once as keyword and once not as keyword

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1454317 LUCENE-4817 : Add KeywordRepeaterFilter to emit tokens twice once as keyword and once not as keyword
          Hide
          Varun Thacker added a comment -

          Really useful token filter.

          You've mentioned that a user should use this with a RemoveDuplicatesTokenFilter, which is needed because if words don't get stemmed there would be duplicates in the same position.

          So in the Javadocs for KeywordRepeatFilterFactory.java should use RemoveDuplicatesTokenFilter in the example.

           
          /**
           * Factory for {@link KeywordRepeatFilter}.
           * <pre class="prettyprint" >
           * &lt;fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100"&gt;
           *   &lt;analyzer&gt;
           *     &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
           *     &lt;filter class="solr.KeywordRepeatFilter"/&gt;
           *     &lt;filter class="solr.PorterStemFilterFactory"/&gt;
           *     &lt;filter class="solr.RemoveDuplicatesTokenFilterFactory"/&gt;
           *   &lt;/analyzer&gt;
           * &lt;/fieldType&gt;</pre>
           */
          
          Show
          Varun Thacker added a comment - Really useful token filter. You've mentioned that a user should use this with a RemoveDuplicatesTokenFilter, which is needed because if words don't get stemmed there would be duplicates in the same position. So in the Javadocs for KeywordRepeatFilterFactory.java should use RemoveDuplicatesTokenFilter in the example. /** * Factory for {@link KeywordRepeatFilter}. * <pre class= "prettyprint" > * &lt;fieldType name= "text_keyword" class= "solr.TextField" positionIncrementGap= "100" &gt; * &lt;analyzer&gt; * &lt;tokenizer class= "solr.WhitespaceTokenizerFactory" /&gt; * &lt;filter class= "solr.KeywordRepeatFilter" /&gt; * &lt;filter class= "solr.PorterStemFilterFactory" /&gt; * &lt;filter class= "solr.RemoveDuplicatesTokenFilterFactory" /&gt; * &lt;/analyzer&gt; * &lt;/fieldType&gt; </pre> */
          Hide
          Erick Erickson added a comment -

          On a quick look, it looks like Porter, KStem, Snowball and Hunspell all respect the keyword attribute. So I'll make the docs only change in the attached patch unless I've misrepresented things (have to run precommit dont'cha know).

          I've include Varun's suggestion as well, thanks!

          It always amazes me how simple some solutions are in the hands of an expert. "Why didn't I think of that?".

          Show
          Erick Erickson added a comment - On a quick look, it looks like Porter, KStem, Snowball and Hunspell all respect the keyword attribute. So I'll make the docs only change in the attached patch unless I've misrepresented things (have to run precommit dont'cha know). I've include Varun's suggestion as well, thanks! It always amazes me how simple some solutions are in the hands of an expert. "Why didn't I think of that?".
          Hide
          Simon Willnauer added a comment -

          I've include Varun's suggestion as well, thanks!

          +1 I still don't get why we have solr XML in the lucene analyzer javadocs, did we settle on that?

          Show
          Simon Willnauer added a comment - I've include Varun's suggestion as well, thanks! +1 I still don't get why we have solr XML in the lucene analyzer javadocs, did we settle on that?
          Hide
          Erick Erickson added a comment -

          Simon:

          Good point, it doesn't belong there. I'll put Varun's suggestion on the Wiki instead.

          Show
          Erick Erickson added a comment - Simon: Good point, it doesn't belong there. I'll put Varun's suggestion on the Wiki instead.
          Hide
          Ahmet Arslan added a comment -

          Very clever thinking. SOLR-3231 can be closed now, right?

          Show
          Ahmet Arslan added a comment - Very clever thinking. SOLR-3231 can be closed now, right?
          Hide
          Ahmet Arslan added a comment -

          One other benefit of this filter is : it eliminates the confusion caused by wildcard searches on a stemmed field. Example : http://search-lucene.com/m/oOv5h2ZqC7Q1

          Show
          Ahmet Arslan added a comment - One other benefit of this filter is : it eliminates the confusion caused by wildcard searches on a stemmed field. Example : http://search-lucene.com/m/oOv5h2ZqC7Q1
          Hide
          Erick Erickson added a comment -

          Revised doc taking Solr config out, see the analyserz/tokenizers page on the Wiki.

          Show
          Erick Erickson added a comment - Revised doc taking Solr config out, see the analyserz/tokenizers page on the Wiki.
          Hide
          Simon Willnauer added a comment -

          looks good

          Show
          Simon Willnauer added a comment - looks good
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Erick Erickson
          http://svn.apache.org/viewvc?view=revision&revision=1454392

          Doc-only change for LUCENE-4817 (plus merge cruft)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Erick Erickson http://svn.apache.org/viewvc?view=revision&revision=1454392 Doc-only change for LUCENE-4817 (plus merge cruft)
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Erick Erickson
          http://svn.apache.org/viewvc?view=revision&revision=1454394

          Doc-only change for LUCENE-4817 (plus merge cruft, try 2)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Erick Erickson http://svn.apache.org/viewvc?view=revision&revision=1454394 Doc-only change for LUCENE-4817 (plus merge cruft, try 2)
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Unassigned
              Reporter:
              Simon Willnauer
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development