Tika
  1. Tika
  2. TIKA-1001

tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      attached document extracts correctly in Tika 1.1
      attached document extracts incorrectly in tika 1.2.

      The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8.
      tika 1.2 appears to ignore the charset specified in the meta tag.

      Some noodling seems to indicate that the problem is the charset.

      it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage).

      1. TIKA-1001v1.tar.gz
        4 kB
        Tim Allison
      2. badarabic.html
        2 kB
        david lemon

        Issue Links

          Activity

          david lemon created issue -
          david lemon made changes -
          Field Original Value New Value
          Attachment badarabic.html [ 12547599 ]
          david lemon made changes -
          Description attached document extracts correctly in Tika 1.1
          attached document extracts incorrectly in tika 1.2.

          The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8.
          tika 1.2 appears to ignore the charset specified in the meta tag.

          Some noodling seems to indicate that the problem is the charset.
          attached document extracts correctly in Tika 1.1
          attached document extracts incorrectly in tika 1.2.

          The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8.
          tika 1.2 appears to ignore the charset specified in the meta tag.

          Some noodling seems to indicate that the problem is the charset.

          it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage).
          david lemon made changes -
          Link This issue is related too TIKA-593 [ TIKA-593 ]
          david lemon made changes -
          Link This issue is related too TIKA-593 [ TIKA-593 ]
          david lemon made changes -
          Link This issue is related too TIKA-431 [ TIKA-431 ]
          Gavin made changes -
          Link This issue is related to TIKA-431 [ TIKA-431 ]
          Gavin made changes -
          Link This issue is related to TIKA-431 [ TIKA-431 ]
          Tim Allison made changes -
          Attachment TIKA-1001v1.tar.gz [ 12597480 ]
          Tim Allison made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Tim Allison made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              david lemon
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development