Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4730

SmartChineseAnalyzer got wrong matched offset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 4.0, 4.1
    • None
    • modules/analysis
    • None
    • JDK1.7 Linux/Windows

    • New

    Description

      We found that SmartChineseAnalyzer got wrong matched offset with the following test code:

      public void testHighlight() throws Exception {
      String text = "My China ";
      String queryText = "China";
      StringBuilder builder = new StringBuilder("<html>");
      Analyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_40);
      //Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
      QueryParser parser = new QueryParser(Version.LUCENE_40, "text", analyzer);
      Query query = parser.parse(queryText);
      SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span style=\"background: yellow\">", "</span>");
      TokenStream tokens = analyzer.tokenStream("text", new StringReader(text));
      QueryScorer scorer = new QueryScorer(query, "text");
      Highlighter highlighter = new Highlighter(formatter, scorer);
      highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer));
      String result = highlighter.getBestFragments(tokens, text, 10, "...");
      if (result.length() < text.length())

      { result = text; }

      builder.append("<body>");
      builder.append(result);
      builder.append("</body>");
      builder.append("</html>");
      System.out.println(builder.toString());
      }

      This method will generate a hilighted text, however, the highlight position is obviously wrong, and if we remove one space from the text, that is, change text from "My China " (ends with two spaces) to "My China " (ends with one space), it will generate a text with correct highlight. If we change the analyzer from SmartChineseAnalyzer to StandardAnalyzer, the highlight issue will disappear.

      Attachments

        1. LUCENE-4730.patch
          2 kB
          Michael Dodsworth

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cloudwave Jinsong Hu
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: