[LUCENE-105] [PATCH] HTML parser should treat <td> as a word break element - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.2
Fix Version/s: None
Component/s: modules/examples
Labels:
None
Environment:

Operating System: All
Platform: All

Bugzilla Id:
19253

Description

When parsing HTML code " abc</td><dt>xyz " the HTML parser skips over elements
and concatenates text around them without separating them with white space, in
that case producing abcxyz. Searching resulting index will not be able to find
the abc.

At least for tags <td>, <p>, <br>, <blockquote>, <dt>, <h1> - <h6>, <li>, and
<q> the parser should separate string on both sides of tags with space. Using
square brackets "[", or "]" for separating gthe strings will also work as it is
already used for text in ALT attribute of images.

There is a workaround for this bug to add spaces when authoring HTML code, but
that may not always be done if documents are created by somebody else.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ASF.LICENSE.NOT.GRANTED--html_parser.diff
01/Nov/03 03:11
0.7 kB
Daniel Naber
ASF.LICENSE.NOT.GRANTED--html_parser2.diff
11/Nov/03 20:39
2 kB
Daniel Naber

Activity

People

Assignee:: Unassigned

Reporter:: Konrad Kolosowski

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 23/Apr/03 23:53

Updated:: 28/Aug/22 11:13

Resolved:: 03/Sep/05 15:24