Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-105

[PATCH] HTML parser should treat <td> as a word break element

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: None
    • Component/s: modules/examples
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

    • Bugzilla Id:
      19253

      Description

      When parsing HTML code " abc</td><dt>xyz " the HTML parser skips over elements
      and concatenates text around them without separating them with white space, in
      that case producing abcxyz. Searching resulting index will not be able to find
      the abc.

      At least for tags <td>, <p>, <br>, <blockquote>, <dt>, <h1> - <h6>, <li>, and
      <q> the parser should separate string on both sides of tags with space. Using
      square brackets "[", or "]" for separating gthe strings will also work as it is
      already used for text in ALT attribute of images.

      There is a workaround for this bug to add spaces when authoring HTML code, but
      that may not always be done if documents are created by somebody else.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              konradk@ca.ibm.com Konrad Kolosowski
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: