Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4817

Add KeywordRepeaterFilter to emit tokens twice once as keyword and once not as keyword

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 4.1
    • Fix Version/s: 4.3, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      if you want to have a stemmed and an unstemmed version of a token one for recall and one for precision you have to do two fields today in most of the cases. Yet, most of the stemmers respect the keyword attribute so we could add a token filter that emits the same token twice once as keyword and once plain. Folks would most likely need to combine this RemoveDuplicatesTokenFilter but that way we can have stemmed and unstemmed version in the same field.

      1. docs.patch
        5 kB
        Erick Erickson
      2. docs.patch
        5 kB
        Erick Erickson
      3. LUCENE-4817.patch
        10 kB
        Simon Willnauer
      4. LUCENE-4817.patch
        5 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          simonw Simon Willnauer added a comment -

          here is a simple patch and test

          Show
          simonw Simon Willnauer added a comment - here is a simple patch and test
          Hide
          thetaphi Uwe Schindler added a comment -

          This sounds like a good idea. In general we should at some place have a general guideline, which type of filters should add things like stems, and which filters should only replace tokens.

          Show
          thetaphi Uwe Schindler added a comment - This sounds like a good idea. In general we should at some place have a general guideline, which type of filters should add things like stems, and which filters should only replace tokens.
          Hide
          romseygeek Alan Woodward added a comment -

          +1, I've implemented this about half-a-dozen times in the past six months for various projects

          Show
          romseygeek Alan Woodward added a comment - +1, I've implemented this about half-a-dozen times in the past six months for various projects
          Hide
          simonw Simon Willnauer added a comment -

          new patch, added a token filter factory, changes entry and added the factory to the services file. I will commit shortly

          Show
          simonw Simon Willnauer added a comment - new patch, added a token filter factory, changes entry and added the factory to the services file. I will commit shortly
          Hide
          simonw Simon Willnauer added a comment -

          adding patch again, this time the right one.

          Show
          simonw Simon Willnauer added a comment - adding patch again, this time the right one.
          Hide
          commit-tag-bot Commit Tag Bot added a comment -

          [trunk commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1454313

          LUCENE-4817: Add KeywordRepeaterFilter to emit tokens twice once as keyword and once not as keyword

          Show
          commit-tag-bot Commit Tag Bot added a comment - [trunk commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1454313 LUCENE-4817 : Add KeywordRepeaterFilter to emit tokens twice once as keyword and once not as keyword
          Hide
          commit-tag-bot Commit Tag Bot added a comment -

          [branch_4x commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1454317

          LUCENE-4817: Add KeywordRepeaterFilter to emit tokens twice once as keyword and once not as keyword

          Show
          commit-tag-bot Commit Tag Bot added a comment - [branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1454317 LUCENE-4817 : Add KeywordRepeaterFilter to emit tokens twice once as keyword and once not as keyword
          Hide
          varunthacker Varun Thacker added a comment -

          Really useful token filter.

          You've mentioned that a user should use this with a RemoveDuplicatesTokenFilter, which is needed because if words don't get stemmed there would be duplicates in the same position.

          So in the Javadocs for KeywordRepeatFilterFactory.java should use RemoveDuplicatesTokenFilter in the example.

           
          /**
           * Factory for {@link KeywordRepeatFilter}.
           * <pre class="prettyprint" >
           * &lt;fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100"&gt;
           *   &lt;analyzer&gt;
           *     &lt;tokenizer class="solr.WhitespaceTokenizerFactory"/&gt;
           *     &lt;filter class="solr.KeywordRepeatFilter"/&gt;
           *     &lt;filter class="solr.PorterStemFilterFactory"/&gt;
           *     &lt;filter class="solr.RemoveDuplicatesTokenFilterFactory"/&gt;
           *   &lt;/analyzer&gt;
           * &lt;/fieldType&gt;</pre>
           */
          
          Show
          varunthacker Varun Thacker added a comment - Really useful token filter. You've mentioned that a user should use this with a RemoveDuplicatesTokenFilter, which is needed because if words don't get stemmed there would be duplicates in the same position. So in the Javadocs for KeywordRepeatFilterFactory.java should use RemoveDuplicatesTokenFilter in the example. /** * Factory for {@link KeywordRepeatFilter}. * <pre class= "prettyprint" > * &lt;fieldType name= "text_keyword" class= "solr.TextField" positionIncrementGap= "100" &gt; * &lt;analyzer&gt; * &lt;tokenizer class= "solr.WhitespaceTokenizerFactory" /&gt; * &lt;filter class= "solr.KeywordRepeatFilter" /&gt; * &lt;filter class= "solr.PorterStemFilterFactory" /&gt; * &lt;filter class= "solr.RemoveDuplicatesTokenFilterFactory" /&gt; * &lt;/analyzer&gt; * &lt;/fieldType&gt; </pre> */
          Hide
          erickerickson Erick Erickson added a comment -

          On a quick look, it looks like Porter, KStem, Snowball and Hunspell all respect the keyword attribute. So I'll make the docs only change in the attached patch unless I've misrepresented things (have to run precommit dont'cha know).

          I've include Varun's suggestion as well, thanks!

          It always amazes me how simple some solutions are in the hands of an expert. "Why didn't I think of that?".

          Show
          erickerickson Erick Erickson added a comment - On a quick look, it looks like Porter, KStem, Snowball and Hunspell all respect the keyword attribute. So I'll make the docs only change in the attached patch unless I've misrepresented things (have to run precommit dont'cha know). I've include Varun's suggestion as well, thanks! It always amazes me how simple some solutions are in the hands of an expert. "Why didn't I think of that?".
          Hide
          simonw Simon Willnauer added a comment -

          I've include Varun's suggestion as well, thanks!

          +1 I still don't get why we have solr XML in the lucene analyzer javadocs, did we settle on that?

          Show
          simonw Simon Willnauer added a comment - I've include Varun's suggestion as well, thanks! +1 I still don't get why we have solr XML in the lucene analyzer javadocs, did we settle on that?
          Hide
          erickerickson Erick Erickson added a comment -

          Simon:

          Good point, it doesn't belong there. I'll put Varun's suggestion on the Wiki instead.

          Show
          erickerickson Erick Erickson added a comment - Simon: Good point, it doesn't belong there. I'll put Varun's suggestion on the Wiki instead.
          Hide
          iorixxx Ahmet Arslan added a comment -

          Very clever thinking. SOLR-3231 can be closed now, right?

          Show
          iorixxx Ahmet Arslan added a comment - Very clever thinking. SOLR-3231 can be closed now, right?
          Hide
          iorixxx Ahmet Arslan added a comment -

          One other benefit of this filter is : it eliminates the confusion caused by wildcard searches on a stemmed field. Example : http://search-lucene.com/m/oOv5h2ZqC7Q1

          Show
          iorixxx Ahmet Arslan added a comment - One other benefit of this filter is : it eliminates the confusion caused by wildcard searches on a stemmed field. Example : http://search-lucene.com/m/oOv5h2ZqC7Q1
          Hide
          erickerickson Erick Erickson added a comment -

          Revised doc taking Solr config out, see the analyserz/tokenizers page on the Wiki.

          Show
          erickerickson Erick Erickson added a comment - Revised doc taking Solr config out, see the analyserz/tokenizers page on the Wiki.
          Hide
          simonw Simon Willnauer added a comment -

          looks good

          Show
          simonw Simon Willnauer added a comment - looks good
          Hide
          commit-tag-bot Commit Tag Bot added a comment -

          [branch_4x commit] Erick Erickson
          http://svn.apache.org/viewvc?view=revision&revision=1454392

          Doc-only change for LUCENE-4817 (plus merge cruft)

          Show
          commit-tag-bot Commit Tag Bot added a comment - [branch_4x commit] Erick Erickson http://svn.apache.org/viewvc?view=revision&revision=1454392 Doc-only change for LUCENE-4817 (plus merge cruft)
          Hide
          commit-tag-bot Commit Tag Bot added a comment -

          [branch_4x commit] Erick Erickson
          http://svn.apache.org/viewvc?view=revision&revision=1454394

          Doc-only change for LUCENE-4817 (plus merge cruft, try 2)

          Show
          commit-tag-bot Commit Tag Bot added a comment - [branch_4x commit] Erick Erickson http://svn.apache.org/viewvc?view=revision&revision=1454394 Doc-only change for LUCENE-4817 (plus merge cruft, try 2)
          Hide
          thetaphi Uwe Schindler added a comment -

          Closed after release.

          Show
          thetaphi Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Unassigned
              Reporter:
              simonw Simon Willnauer
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development