Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9009

Support for keyword protect when using ICUFoldingFilter and KeywordRepetFilter

    XMLWordPrintableJSON

    Details

    • Lucene Fields:
      New

      Description

      It would be great to support keyword protection if KeywordRepetFilter used. Like implementation at PorterStemFilter.

      @Override
        public final boolean incrementToken() throws IOException {
          if (!input.incrementToken())
            return false;
      
          if ((!keywordAttr.isKeyword()) && stemmer.stem(termAtt.buffer(), 0, termAtt.length()))
            termAtt.copyBuffer(stemmer.getResultBuffer(), 0, stemmer.getResultLength());
          return true;
        }
      

       

      Scenario:

      We analyzing word with some accents, for example "groš". And we would like to define searching like that:

      1, Search for all items which have "groš"  or "gros" in some field.

      2, If we send directly search phrase "groš" we want to prefer items with word "groš" and after that with word "gros" (by the higher score). Therefore we use KeywordProtectFilter.

       

      Example field definition:

      <fieldType name="some_text" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.KeywordRepeatFilterFactory" />
            <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
            <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
      
            <filter class="solr.HunspellStemFilterFactory"
      					dictionary="lang/our_en_US.dic"
      					affix="lang/our_en_US.aff"
      					ignoreCase="true"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.KeywordRepeatFilterFactory" />
            <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
            <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
            <filter class="solr.HunspellStemFilterFactory"
      					dictionary="lang/our_en_US.dic"
      					affix="lang/our_en_US.aff"
      					ignoreCase="true" />
            <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        </analyzer>
      </fieldType>
      

       

      The result of query and index analyzers should be like this: 

      text groš gros
      raw_bytes [67 72 6f c5 a1] [67 72 6f c5 a1]
      start 0 0
      end 4 4
      positionLength 1 1
      type <ALPHANUM> <ALPHANUM>
      termFrequency 1 1
      position 1 1
      keyword true false

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              profimedia Profimedia
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: