Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2671

HtmlEncodingDetector doesnt take provided metadata into account

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • detector
    • None

    Description

      org.apache.tika.parser.html.HtmlEncodingDetector ignores the document's metadata. So when using it to detect the charset of an HTML document that came with a conflicting charset specified at the transport layer level, the encoding specified inside the file is used instead.

      This behavior does not conform to what is specified by the W3C for determining the character encoding of HTML pages. This causes bugs similar to NUTCH-2599.

      If HtmlEncodingDetector is not meant to take into account meta-information about the document, then maybe another detector should be provided, that would be a CompositeDetector including, in that order:

      • a new, simple, MetadataEncodingDetector, that would simply return the encoding
      • the existing HtmlEncodingDetector
      • a generic detector, like UniversalEncodingDetector

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            gbouchar Gerard Bouchar

            Dates

              Created:
              Updated:

              Slack

                Issue deployment