Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1733

parse-html to support HTML5 charset definitions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8, 2.2.1
    • 2.3, 1.9
    • parser
    • None

    Description

      HTML 5 allows to specify the character encoding of a page per

      • <meta charset="...">
      • Unicode Byte Order Mark (BOM)

      These are allowed in addition to previous HTTP/http-equiv Content-Type, see [1.

      Parse-html ignores both meta charset and BOM, falls back to the default encoding (cp1252). Parse-tika sets the encoding appropriately.

      Attachments

        1. charset_bom_html5.html
          0.5 kB
          Sebastian Nagel
        2. charset_html5.html
          0.5 kB
          Sebastian Nagel
        3. NUTCH-1733-trunk.patch
          8 kB
          Sebastian Nagel
        4. charset_bom_utf16_html5.html
          1 kB
          Sebastian Nagel
        5. NUTCH-1733-2.x.patch
          9 kB
          Sebastian Nagel

        Activity

          People

            Unassigned Unassigned
            snagel Sebastian Nagel
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: