Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1733

parse-html to support HTML5 charset definitions

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8, 2.2.1
    • Fix Version/s: 2.3, 1.9
    • Component/s: parser
    • Labels:
      None

      Description

      HTML 5 allows to specify the character encoding of a page per

      • <meta charset="...">
      • Unicode Byte Order Mark (BOM)

      These are allowed in addition to previous HTTP/http-equiv Content-Type, see [1.

      Parse-html ignores both meta charset and BOM, falls back to the default encoding (cp1252). Parse-tika sets the encoding appropriately.

        Attachments

        1. NUTCH-1733-trunk.patch
          8 kB
          Sebastian Nagel
        2. NUTCH-1733-2.x.patch
          9 kB
          Sebastian Nagel
        3. charset_html5.html
          0.5 kB
          Sebastian Nagel
        4. charset_bom_utf16_html5.html
          1 kB
          Sebastian Nagel
        5. charset_bom_html5.html
          0.5 kB
          Sebastian Nagel

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              snagel Sebastian Nagel
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: