Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-539

Encoding detection is too biased by encoding in meta tag

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Reopened
    • Minor
    • Resolution: Unresolved
    • 0.8, 0.9, 0.10
    • 1.17, 2.0.0-BETA, 2.1.0
    • metadata, parser
    • None

    Description

      if the encoding in the meta tag is wrong, this encoding is detected,
      even if there is the right encoding set in metadata before(which can be from http response header).

      test code to reproduce:

      static String content = "<html><head>\n"
      + "<meta http-equiv=\"content-type\" content=\"application/xhtml+xml; charset=iso-8859-1\" />"
      + "</head><body>Über den Wolken\n</body></html>";

      /**

      • @param args
      • @throws IOException
      • @throws TikaException
      • @throws SAXException
        */
        public static void main(String[] args) throws IOException, SAXException,
        TikaException { Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, "text/html"); metadata.set(Metadata.CONTENT_ENCODING, "UTF-8"); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); InputStream in = new ByteArrayInputStream(content.getBytes("UTF-8")); AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler h = new BodyContentHandler(10000); parser.parse(in, h, metadata, new ParseContext()); System.out.print(h.toString()); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); }

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kkrugler Kenneth William Krugler
            reinhard Reinhard Pötz

            Dates

              Created:
              Updated:

              Issue deployment