Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5734

HTMLStripCharFilter end offset should be left of closing tags

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Consider this simple input:

      <em>hello</em>
      

      to be analyzed by HTMLStripCharFilter and WhitespaceTokenizer.
      You get back one token for "hello". Good. The start offset of this token is at the position of 'h' – good. But the end offset is surprisingly plus one to the adjacent </em>. I argue that it should be plus one to the last character of the token (following 'o').

      FYI it behaves as I expect if after hello is an XML entity such as in this example:

      hello&nbsp;

      The end offset immediately follows the 'o'.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                dsmiley David Smiley
              • Votes:
                2 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated: