Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-369

StringUtil.resolveEncodingAlias is unuseful.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: None
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      all

      Description

      After we defined encoding alias map in StringUtil , but parse html use orginal encoding also.

      I found it is reading charset from meta in nekohtml which HtmlParser used .

      we can set it's feature "http://cyberneko.org/html/features/scanner/ignore-specified-charset" to true
      that nekohtml will use encoding we set;

      concretely,

      private DocumentFragment parseNeko(InputSource input) throws Exception {
      DOMFragmentParser parser = new DOMFragmentParser();
      // some plugins, e.g., creativecommons, need to examine html comments
      try {
      + parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true);
      parser.setFeature("http://apache.org/xml/features/include-comments",
      true);
      ....

      BTW, It must be add on front of try block,because the following sentence (parser.setFeature("http://apache.org/xml/features/include-comments",
      true) will throw exception.

        Attachments

        1. remover.diff
          3 kB
          Renaud Richardet
        2. patch.diff
          2 kB
          Renaud Richardet

          Activity

            People

            • Assignee:
              dogacan Dogacan Guney
              Reporter:
              chinawab King Kong
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: