Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7926

Hit highlighting with EdgeNGramFilterFactory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reopened
    • Critical
    • Resolution: Unresolved
    • 5.1, 5.2.1
    • None
    • highlighter
    • CentOS 7 (5.2.1), OS X 10.10.5 (5.1)

    Description

      Hit highlight highlights the whole word, not just the part that matches the search term when using EdgeNGramFilterFactory in the field type.

      In schema.xml I have field type text_ngram:

      <fieldType name="text_ngram" class="solr.TextField">
      <analyzer type="index">
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <!-tokenizer class="solr.StandardTokenizerFactory"/->
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/>
      <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" replacement="" replace="all"/>
      </analyzer>
      <analyzer type="query">
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" replacement="" replace="all"/>
      <filter class="solr.PatternReplaceFilterFactory" pattern="^(.

      {20}

      )(.*)?" replacement="$1" replace="all"/>
      </analyzer>
      </fieldType>

      In Solr Admin analyse, with index value "lucene" and query value "luc" it shows this:

      LENGTF text luc luce lucen lucene
      raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e] [6c 75 63 65 6e 65]
      start 0 0 0 0
      end 6 6 6 6
      positionLength 1 1 1 1
      type word word word word
      position 1 1 1 1

      Since the end position is 6 in this case the whole word ("lucene" is highlighted).

      If I change to use NGramFilterFactory it shows me this (for the first three items):

      LENGTF text luc uce cen
      raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e]
      start 0 1 2
      end 3 4 5
      positionLength 1 1 1
      type word word word
      position 1 1 1

      The end position is correct then and the highlighter highlights only the search term. Note that I have specified luceneMatchVersion="4.3". Without this, the end positions goes back to 6 also for the NGramFilterFactory.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bjornhjelle Bjørn Hjelle
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: