[TIKA-1001] tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2
Fix Version/s: None
Component/s: parser
Labels:
None

Description

attached document extracts correctly in Tika 1.1
attached document extracts incorrectly in tika 1.2.

The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8.
tika 1.2 appears to ignore the charset specified in the meta tag.

Some noodling seems to indicate that the problem is the charset.

it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-1001v1.tar.gz
12/Aug/13 14:22
4 kB
Tim Allison
badarabic.html
03/Oct/12 21:29
2 kB
david lemon

Issue Links

is related to

TIKA-431 Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: david lemon

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Oct/12 21:28

Updated:: 16/Aug/13 14:34

Resolved:: 16/Aug/13 11:50