Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-369

StringUtil.resolveEncodingAlias is unuseful.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.9.0
    • None
    • fetcher
    • None
    • all

    Description

      After we defined encoding alias map in StringUtil , but parse html use orginal encoding also.

      I found it is reading charset from meta in nekohtml which HtmlParser used .

      we can set it's feature "http://cyberneko.org/html/features/scanner/ignore-specified-charset" to true
      that nekohtml will use encoding we set;

      concretely,

      private DocumentFragment parseNeko(InputSource input) throws Exception {
      DOMFragmentParser parser = new DOMFragmentParser();
      // some plugins, e.g., creativecommons, need to examine html comments
      try {
      + parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true);
      parser.setFeature("http://apache.org/xml/features/include-comments",
      true);
      ....

      BTW, It must be add on front of try block,because the following sentence (parser.setFeature("http://apache.org/xml/features/include-comments",
      true) will throw exception.

      Attachments

        1. remover.diff
          3 kB
          Renaud Richardet
        2. patch.diff
          2 kB
          Renaud Richardet

        Activity

          People

            dogacan Dogacan Guney
            chinawab King Kong
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: