Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2318

Text extraction in HtmlParser adds too much whitespace.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.1, 1.15
    • Fix Version/s: 1.19
    • Component/s: parser, plugin
    • Labels:
      None

      Description

      In parse-html, org.apache.nutch.parse.html.HtmlParser will call DOMContentUtils.getText() to extract the text content. For every text node encountered in the document, the getTextHelper() function will first add a space character to the already extracted text and then the text content itself (stripped of excess whitespace). This means that parsing HTML such as

      <p>behavi<em>ou</em>r</p>

      will lead to this extracted text:

      behavi ou r

      I would have expected a parser not to add whitespace to content that visually (and actually) does not contain any in the first place. This applies to all similar semantic tags as well as <span>.

      My naiive approach would be to remove the lines text = text.trim() and sb.append(' '), but I'm aware that this will lead to bad parsing of stuff like <p>foo</p><p>bar</p>.

      This is not an issue in parse-tika, since tika removes all "unimportant" tags beforehand. However, I'd like to keep using parse-html because I need to keep the document reasonably intact for parse filters applied later.

      I know I could write a parse filter that will re-extract the text content, but this feels like a bug (or at least a shortcoming) in the ParseHtml.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              fezett Felix Zett
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: