Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-105

[PATCH] HTML parser should treat <td> as a word break element

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: None
    • Component/s: modules/examples
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

    • Bugzilla Id:
      19253

      Description

      When parsing HTML code " abc</td><dt>xyz " the HTML parser skips over elements
      and concatenates text around them without separating them with white space, in
      that case producing abcxyz. Searching resulting index will not be able to find
      the abc.

      At least for tags <td>, <p>, <br>, <blockquote>, <dt>, <h1> - <h6>, <li>, and
      <q> the parser should separate string on both sides of tags with space. Using
      square brackets "[", or "]" for separating gthe strings will also work as it is
      already used for text in ALT attribute of images.

      There is a workaround for this bug to add spaces when authoring HTML code, but
      that may not always be done if documents are created by somebody else.

        Activity

        Hide
        daniel.naber@t-online.de Daniel Naber added a comment -

        Created an attachment (id=8851)
        patch to fix bug

        Show
        daniel.naber@t-online.de Daniel Naber added a comment - Created an attachment (id=8851) patch to fix bug
        Hide
        daniel.naber@t-online.de Daniel Naber added a comment -

        I added an attachment that fixes this problem. Other elements (h1 etc) should probably
        also be added to the list, as the original bug report suggests.

        Show
        daniel.naber@t-online.de Daniel Naber added a comment - I added an attachment that fixes this problem. Other elements (h1 etc) should probably also be added to the list, as the original bug report suggests.
        Hide
        daniel.naber@t-online.de Daniel Naber added a comment -

        Created an attachment (id=9055)
        improved patch

        Show
        daniel.naber@t-online.de Daniel Naber added a comment - Created an attachment (id=9055) improved patch
        Hide
        goller@detego-software.de Christoph Goller added a comment -

        Daniel's patch solves the problem.
        A slightly modified (refurbished) version has been committed.

        Show
        goller@detego-software.de Christoph Goller added a comment - Daniel's patch solves the problem. A slightly modified (refurbished) version has been committed.
        Hide
        konradk@ca.ibm.com Konrad Kolosowski added a comment -

        Thanks for fixing this bug.

        The problem also occurs on closing tags. Could a small change be made to set
        in Tags class to contain "</h1" - "</h5", "</p" ... so it includes closing
        tags by default? Should I open a separate bug?

        Show
        konradk@ca.ibm.com Konrad Kolosowski added a comment - Thanks for fixing this bug. The problem also occurs on closing tags. Could a small change be made to set in Tags class to contain "</h1" - "</h5", "</p" ... so it includes closing tags by default? Should I open a separate bug?
        Hide
        goller@detego-software.de Christoph Goller added a comment -

        closing tags added

        Show
        goller@detego-software.de Christoph Goller added a comment - closing tags added

          People

          • Assignee:
            Unassigned
            Reporter:
            konradk@ca.ibm.com Konrad Kolosowski
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development