Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.7.2, 0.8
    • Component/s: None
    • Labels:
      None

      Description

      I found some sites, where the header says: "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:

      Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
      ===================================================================
      — src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (revision 279397)
      +++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (working copy)
      @@ -120,7 +120,7 @@
      byte[] contentInOctets = content.getContent();
      InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
      String encoding = StringUtil.parseCharacterEncoding(contentType);

      • if (encoding!=null) {
        + if (encoding!=null && !"".equals(encoding)) {
        metadata.put("OriginalCharEncoding", encoding);
        if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
        metadata.put("CharEncodingForConversion", encoding);

        Activity

        Hide
        pkosiorowski Piotr Kosiorowski added a comment -

        Commited with small extension. Thanks.

        Show
        pkosiorowski Piotr Kosiorowski added a comment - Commited with small extension. Thanks.

          People

          • Assignee:
            Unassigned
            Reporter:
            mnebel Michael Nebel
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development