Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5381

Lucene highlighter doesn't honor hl.fragsize; it appends all text for last fragment

Details

    • New, Patch Available

    Description

      Recently, we hit a problem related with highlighter: I set hl.fragsize = 300, but the highlight section for one document outputs more than 2000 characters.

      Look into the code, in org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream, String, boolean, int), after the for loop, it appends whole remaining text into last fragment.
      if (
      // if there is text beyond the last token considered..
      (lastEndOffset < text.length())
      &&
      // and that text is not too large...
      (text.length()<= maxDocCharsToAnalyze)
      )
      {
      //append it to the last fragment
      newText.append(encoder.encodeText(text.substring(lastEndOffset)));
      }
      currentFrag.textEndPos = newText.length();

      This code is problematical, as in some cases, the last fragment is the most relevant section and will be selected to return to client.

      I made some change to the code like below: Now it works.
      //Test what remains of the original text beyond the point where we stopped analyzing
      if(lastEndOffset < text.length())
      {
      if(textFragmenter instanceof SimpleFragmenter)
      {
      SimpleFragmenter simpleFragmenter = (SimpleFragmenter) textFragmenter;
      int remain =simpleFragmenter.getFragmentSize() -(newText.length() - currentFrag.textStartPos);
      if(remain > 0 )
      {
      int endIndex = lastEndOffset + remain;
      if (endIndex > text.length())

      { endIndex = text.length(); }

      newText.append(encoder.encodeText(text.substring(lastEndOffset,
      endIndex)));
      }
      }
      else

      { newText.append(encoder.encodeText(text.substring(lastEndOffset))); }

      }
      currentFrag.textEndPos = newText.length();

      Attachments

        1. LUCENE-5381.patch
          1 kB
          jefferyyuan

        Activity

          People

            Unassigned Unassigned
            yuanyun.cn jefferyyuan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 4h
                4h
                Remaining:
                Remaining Estimate - 4h
                4h
                Logged:
                Time Spent - Not Specified
                Not Specified