[TIKA-343] some parsers produces glued words - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.5, 0.6
Fix Version/s: 0.6
Component/s: parser
Labels:
None

Description

Some parsers ignores word/line delimiters.

Document:
"<html><head></head><body>test<br>test</body></html>"
is decoded by HtmlParser to "testtest".

I think the HtmlParser.mapSafeElement method should be extended by:

if ("BR".equals(name)) return "br";
if ("DIV".equals(name)) return "div";
if ("HR".equals(name)) return "hr";
if ("ADDRESS".equals(name)) return "address";
if ("FIELDSET".equals(name)) return "fieldset";
if ("FORM".equals(name)) return "form";
if ("NOSCRIPT".equals(name)) return "noscript";
if ("NOFRAMES".equals(name)) return "noframes";

Also application/xml documents are parsed by removing unknown tags instead of replacing them into spaces.

Attachments

Issue Links

relates to

SOLR-4908 SolrContentHandler procuces glued words when extracting html

Resolved

Activity

People

Assignee:: Jukka Zitting

Reporter:: Piotr Bartosiewicz

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 07/Dec/09 11:50

Updated:: 07/Jun/13 08:29

Resolved:: 13/Dec/09 22:04