Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-105

[PATCH] HTML parser should treat <td> as a word break element

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.2
    • None
    • modules/examples
    • None
    • Operating System: All
      Platform: All

    • 19253

    Description

      When parsing HTML code " abc</td><dt>xyz " the HTML parser skips over elements
      and concatenates text around them without separating them with white space, in
      that case producing abcxyz. Searching resulting index will not be able to find
      the abc.

      At least for tags <td>, <p>, <br>, <blockquote>, <dt>, <h1> - <h6>, <li>, and
      <q> the parser should separate string on both sides of tags with space. Using
      square brackets "[", or "]" for separating gthe strings will also work as it is
      already used for text in ALT attribute of images.

      There is a workaround for this bug to add spaces when authoring HTML code, but
      that may not always be done if documents are created by somebody else.

      Attachments

        Activity

          People

            Unassigned Unassigned
            konradk@ca.ibm.com Konrad Kolosowski
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: