[LUCENE-4730] SmartChineseAnalyzer got wrong matched offset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 4.0, 4.1
Fix Version/s: None
Component/s: modules/analysis
Labels:
None
Environment:

JDK1.7 Linux/Windows

Lucene Fields:

New

Description

We found that SmartChineseAnalyzer got wrong matched offset with the following test code:

public void testHighlight() throws Exception {
String text = "My China ";
String queryText = "China";
StringBuilder builder = new StringBuilder("<html>");
Analyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_40);
//Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
QueryParser parser = new QueryParser(Version.LUCENE_40, "text", analyzer);
Query query = parser.parse(queryText);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span style=\"background: yellow\">", "</span>");
TokenStream tokens = analyzer.tokenStream("text", new StringReader(text));
QueryScorer scorer = new QueryScorer(query, "text");
Highlighter highlighter = new Highlighter(formatter, scorer);
highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer));
String result = highlighter.getBestFragments(tokens, text, 10, "...");
if (result.length() < text.length())

{ result = text; }

builder.append("<body>");
builder.append(result);
builder.append("</body>");
builder.append("</html>");
System.out.println(builder.toString());
}

This method will generate a hilighted text, however, the highlight position is obviously wrong, and if we remove one space from the text, that is, change text from "My China " (ends with two spaces) to "My China " (ends with one space), it will generate a text with correct highlight. If we change the analyzer from SmartChineseAnalyzer to StandardAnalyzer, the highlight issue will disappear.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-4730.patch
29/Jun/14 03:56
2 kB
Michael Dodsworth

Issue Links

is superceded by

LUCENE-4984 Fix ThaiWordFilter

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Jinsong Hu

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/Jan/13 10:22

Updated:: 28/Aug/22 13:37

Resolved:: 09/Jun/16 14:21