[LUCENE-5381] Lucene highlighter doesn't honor hl.fragsize; it appends all text for last fragment - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 4.0, 4.6
Fix Version/s: 4.9, 6.0
Component/s: modules/highlighter
Labels:
- highlighter
- lucene

Lucene Fields:

New, Patch Available

Description

Recently, we hit a problem related with highlighter: I set hl.fragsize = 300, but the highlight section for one document outputs more than 2000 characters.

Look into the code, in org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream, String, boolean, int), after the for loop, it appends whole remaining text into last fragment.
if (
// if there is text beyond the last token considered..
(lastEndOffset < text.length())
&&
// and that text is not too large...
(text.length()<= maxDocCharsToAnalyze)
)
{
//append it to the last fragment
newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
currentFrag.textEndPos = newText.length();

This code is problematical, as in some cases, the last fragment is the most relevant section and will be selected to return to client.

I made some change to the code like below: Now it works.
//Test what remains of the original text beyond the point where we stopped analyzing
if(lastEndOffset < text.length())
{
if(textFragmenter instanceof SimpleFragmenter)
{
SimpleFragmenter simpleFragmenter = (SimpleFragmenter) textFragmenter;
int remain =simpleFragmenter.getFragmentSize() -(newText.length() - currentFrag.textStartPos);
if(remain > 0 )
{
int endIndex = lastEndOffset + remain;
if (endIndex > text.length())

{ endIndex = text.length(); }

newText.append(encoder.encodeText(text.substring(lastEndOffset,
endIndex)));
}
}
else

{ newText.append(encoder.encodeText(text.substring(lastEndOffset))); }

}
currentFrag.textEndPos = newText.length();

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-5381.patch
01/Jan/14 15:22
1 kB
jefferyyuan

Activity

People

Assignee:: Unassigned

Reporter:: jefferyyuan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Jan/14 15:21

Updated:: 28/Aug/22 13:58

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified