[SOLR-7926] Hit highlighting with EdgeNGramFilterFactory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Reopened
Priority: Critical
Resolution: Unresolved
Affects Version/s: 5.1, 5.2.1
Fix Version/s: None
Component/s: highlighter
Labels:
- EdgeNGramTokenFilter
- highlighting
Environment:

CentOS 7 (5.2.1), OS X 10.10.5 (5.1)

Description

Hit highlight highlights the whole word, not just the part that matches the search term when using EdgeNGramFilterFactory in the field type.

In schema.xml I have field type text_ngram:

<fieldType name="text_ngram" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-~~tokenizer class="solr.StandardTokenizerFactory"/~~->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" replacement="" replace="all"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" replacement="" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="^(.

{20}

)(.*)?" replacement="$1" replace="all"/>
</analyzer>
</fieldType>

In Solr Admin analyse, with index value "lucene" and query value "luc" it shows this:

LENGTF text luc luce lucen lucene
raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e] [6c 75 63 65 6e 65]
start 0 0 0 0
end 6 6 6 6
positionLength 1 1 1 1
type word word word word
position 1 1 1 1

Since the end position is 6 in this case the whole word ("lucene" is highlighted).

If I change to use NGramFilterFactory it shows me this (for the first three items):

LENGTF text luc uce cen
raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e]
start 0 1 2
end 3 4 5
positionLength 1 1 1
type word word word
position 1 1 1

The end position is correct then and the highlighter highlights only the search term. Note that I have specified luceneMatchVersion="4.3". Without this, the end positions goes back to 6 also for the NGramFilterFactory.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Bjørn Hjelle

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Aug/15 10:52

Updated:: 05/Feb/16 10:15