[NUTCH-2318] Text extraction in HtmlParser adds too much whitespace. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.3.1, 1.15
Fix Version/s: 1.21
Component/s: parser, plugin
Labels:
None

Description

In parse-html, org.apache.nutch.parse.html.HtmlParser will call DOMContentUtils.getText() to extract the text content. For every text node encountered in the document, the getTextHelper() function will first add a space character to the already extracted text and then the text content itself (stripped of excess whitespace). This means that parsing HTML such as

behaviour

will lead to this extracted text:

behavi ou r

I would have expected a parser not to add whitespace to content that visually (and actually) does not contain any in the first place. This applies to all similar semantic tags as well as .

My naiive approach would be to remove the lines text = text.trim() and sb.append(' '), but I'm aware that this will lead to bad parsing of stuff like foobar.

This is not an issue in parse-tika, since tika removes all "unimportant" tags beforehand. However, I'd like to keep using parse-html because I need to keep the document reasonably intact for parse filters applied later.

I know I could write a parse filter that will re-extract the text content, but this feels like a bug (or at least a shortcoming) in the ParseHtml.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Felix Zett

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Sep/16 16:07

Updated:: 30/Mar/24 17:19