[TIKA-469] The Parser is not correctly outputting Arabic text documents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.7
Fix Version/s: None
Component/s: parser
Labels:
None
Environment:

Windows XP

Description

The parser is not preserving the character encoding when parsing documents in Arabic UTF-8, specifically with .pdf and .doc. The resulting character output is undechipherable or just question-mark symbols.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ASF.LICENSE.NOT.GRANTED--TEST_WORD.doc
03/Aug/10 20:45
29 kB
Robert Cullen
ASF.LICENSE.NOT.GRANTED--fever_factsheet_arabic.pdf
03/Aug/10 20:45
37 kB
Robert Cullen

Activity

People

Assignee:: Unassigned

Reporter:: Robert Cullen

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 22/Jul/10 14:51

Updated:: 24/Apr/13 16:03