[PDFBOX-679] Corruption of Arabic output due to Japanese bug fix - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.2.0
Component/s: None
Labels:
None

Description

The recent Japanese bug fix in org.apache.pdfbox.pdmodel.font.PDFont
defines a set of encoding names that are given special CJK treatment. This set is too broad. For example, it stipulates that the 'Identity-H' encoding should be processed as JIS.

We have Arabic PDFs where compound Arabic glyphs use the 'Identity-H' encoding. In pdfBox 1.0.0 they used to output Arabic but now they output garbage, because the Arabic unicode data is sent to the CJK converter.

I've copied that description from the users mailing list [1]

[1] http://markmail.org/thread/w5iof5hr3yqhthsp

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

zzz.pdf
06/Apr/10 10:04
247 kB
Yigal Dayan

Activity

Ascending order - Click to sort in descending order

Yigal Dayan added a comment - 06/Apr/10 10:04

Hi Takashi,

I'm attaching an Arabic PDF used as a testcase.

Yigal

Yigal Dayan added a comment - 06/Apr/10 10:04 Hi Takashi, I'm attaching an Arabic PDF used as a testcase. Yigal

Andreas Lehmkühler added a comment - 06/Apr/10 17:07

With version 931224 the converter isn't used if an unicode mapping is given.

@Yigal: Please test that version, if possible. I guess it is easier for you to check the result, as I'm not able to read arabic.

Andreas Lehmkühler added a comment - 06/Apr/10 17:07 With version 931224 the converter isn't used if an unicode mapping is given. @Yigal: Please test that version, if possible. I guess it is easier for you to check the result, as I'm not able to read arabic.

Yigal Dayan added a comment - 07/Apr/10 11:51

version 931224 fixed this issue, thanks!

The other two issues I mentioned in the mail are still not fixed, so I'll go ahead and open bugs on them. Unfortunately I can't provide the fix, but I'll attach a pdf + results before fix and after fix, to make the problem stand out.

Yigal Dayan added a comment - 07/Apr/10 11:51 version 931224 fixed this issue, thanks! The other two issues I mentioned in the mail are still not fixed, so I'll go ahead and open bugs on them. Unfortunately I can't provide the fix, but I'll attach a pdf + results before fix and after fix, to make the problem stand out.

Andreas Lehmkühler added a comment - 07/Apr/10 16:46

Thanks for the feedback!

Set this to resolved

Andreas Lehmkühler added a comment - 07/Apr/10 16:46 Thanks for the feedback! Set this to resolved

Franklin added a comment - 16/Jun/11 13:07

I have tried the extraction of Arabic like this

PDFParser parser = new PDFParser(is);
parser.parse();
cosDoc = parser.getDocument();

String docText = null;
PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = new PDDocument(cosDoc);
docText = stripper.getText(document);

The docText is proper arabic when I try it out with your pdf file (zzz.pdf).

However I get ????? when I try it out with my pdf file.

Is there any reason for this ?

Franklin added a comment - 16/Jun/11 13:07 I have tried the extraction of Arabic like this PDFParser parser = new PDFParser(is); parser.parse(); cosDoc = parser.getDocument(); String docText = null; PDFTextStripper stripper = new PDFTextStripper(); PDDocument document = new PDDocument(cosDoc); docText = stripper.getText(document); The docText is proper arabic when I try it out with your pdf file (zzz.pdf). However I get ????? when I try it out with my pdf file. Is there any reason for this ?

Andreas Lehmkühler added a comment - 16/Jun/11 17:43

First of all, please don't use already closed issues!!

Try to use a proper encoding. Have a look at org.apache.pdfbox.ExtractText for further details. If the issue still exists, file a new issue on JIRA and attach the pdf in question.

Andreas Lehmkühler added a comment - 16/Jun/11 17:43 First of all, please don't use already closed issues!! Try to use a proper encoding. Have a look at org.apache.pdfbox.ExtractText for further details. If the issue still exists, file a new issue on JIRA and attach the pdf in question.

Franklin added a comment - 17/Jun/11 06:56

Extremely sorry about that.

Thanks for the hints. I shall look into that and get back.

Regards,
Franklin

Franklin added a comment - 17/Jun/11 06:56 Extremely sorry about that. Thanks for the hints. I shall look into that and get back. Regards, Franklin

People

Assignee:: Unassigned

Reporter:: Andreas Lehmkühler

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 05/Apr/10 13:10

Updated:: 17/Jun/11 06:56

Resolved:: 07/Apr/10 16:46

PDFBox

Details

Description

Attachments

Attachments

Activity

People

Dates