Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1001

tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.2
    • None
    • parser
    • None

    Description

      attached document extracts correctly in Tika 1.1
      attached document extracts incorrectly in tika 1.2.

      The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8.
      tika 1.2 appears to ignore the charset specified in the meta tag.

      Some noodling seems to indicate that the problem is the charset.

      it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage).

      Attachments

        1. TIKA-1001v1.tar.gz
          4 kB
          Tim Allison
        2. badarabic.html
          2 kB
          david lemon

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            david_lemon david lemon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment