Uploaded image for project: 'Apache Any23 (Retired)'
  1. Apache Any23 (Retired)
  2. ANY23-411

Use Content-Type to help determine encoding

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3
    • 2.3
    • encoding
    • None

    Description

      Incredibly enough, it seems that our encoding detector does not take the Content-Type header into account at all when trying to guess a document's charset encoding!

      This has caused a problem for me with the page: http://w3c.github.io/microdata-rdf/tests/0065.html

      Even though the Content-Type header is set to "text/html; charset=utf-8", we're guessing the charset to be: "IBM500", which in turn renders the page into complete gibberish.

      This must be a bug in Tika, because even when I set the declared encoding of the charset detector to UTF-8, IBM500 is still the most confident result.

      Cf. https://issues.apache.org/jira/browse/TIKA-2771

      Attachments

        Activity

          People

            hansbrende Hans Brende
            hansbrende Hans Brende
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: