[TIKA-2257] Arabic vowel marks displaced when reading from PDF - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.14
Fix Version/s: None
Component/s: parser
Labels:
None

Description

The attached file, in its second line, contains “العَرَبِيَّة”. It was created on Mac OS X 10.11.6 by selecting “Save As PDF” from the system print dialog started from Microsoft Word.

Reading the text from the PDF file, the short a vowel marks are displaced, returning
U+0627 U+0644 U+064E U+0639 U+064E U+0631 U+0628 U+0650 U+06CC U+0651 U+064E U+0629 instead of the expected
U+0627 U+0644 U+0639 U+064E U+0631 U+064E U+0628 U+0650 U+064A U+064E U+0651 U+0629 (الَعَربِیَّة instead of العَرَبِيَّة).

Here is the (MATLAB) code used for reading:

handler = org.apache.tika.sax.ToXMLContentHandler;
parser = org.apache.tika.parser.AutoDetectParser;
metadata = org.apache.tika.metadata.Metadata;
fh = java.io.FileInputStream(fullname);
parser.parse(fh, handler, metadata);
s = string(handler.toString);
fh.close;