Uploaded image for project: 'Apache Any23 (Retired)'
  1. Apache Any23 (Retired)
  2. ANY23-385

Improve charset detection for (x)html documents

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3
    • 2.3
    • encoding
    • None

    Description

      When attempting to detect a document's encoding, our TikaEncodingDetector does not take into account the following elements which may occur in html/xhtml documents:

      HTML:
      <meta http-equiv="content-type" content="text/html; charset=xyz"/>

      HTML5:
      <meta charset="xyz">

      XHTML:
      <?xml encoding='xyz'?>

      In addition, the TikaEncodingDetector only sniffs the first 12000 bytes of the document, meaning that if, for example, the first UTF-8 encoded character occurs later than that, the detector may misidentify the encoding as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta charset element of the page.)

      I have seen this problem occur with, e.g., the webpage http://losangeles.eventful.com/events/september (where the UTF-8 charset was properly specified at the top of the page, but the first UTF-8 encoded characters occurred far past the 12000 byte mark in JSON-LD content towards the bottom of the page, causing the TikaEncodingDetector to misidentify the encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking like gibberish).

      Attachments

        Issue Links

          Activity

            People

              hansbrende Hans Brende
              hansbrende Hans Brende
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: