Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5734

HTMLStripCharFilter end offset should be left of closing tags

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • modules/analysis
    • None
    • New

    Description

      Consider this simple input:

      <em>hello</em>
      

      to be analyzed by HTMLStripCharFilter and WhitespaceTokenizer.
      You get back one token for "hello". Good. The start offset of this token is at the position of 'h' – good. But the end offset is surprisingly plus one to the adjacent </em>. I argue that it should be plus one to the last character of the token (following 'o').

      FYI it behaves as I expect if after hello is an XML entity such as in this example:

      hello&nbsp;

      The end offset immediately follows the 'o'.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dsmiley David Smiley
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: