Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-679

Corruption of Arabic output due to Japanese bug fix

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.1.0
    • 1.2.0
    • None
    • None

    Description

      The recent Japanese bug fix in org.apache.pdfbox.pdmodel.font.PDFont
      defines a set of encoding names that are given special CJK treatment. This set is too broad. For example, it stipulates that the 'Identity-H' encoding should be processed as JIS.

      We have Arabic PDFs where compound Arabic glyphs use the 'Identity-H' encoding. In pdfBox 1.0.0 they used to output Arabic but now they output garbage, because the Arabic unicode data is sent to the CJK converter.

      I've copied that description from the users mailing list [1]

      [1] http://markmail.org/thread/w5iof5hr3yqhthsp

      Attachments

        1. zzz.pdf
          247 kB
          Yigal Dayan

        Activity

          ydayan Yigal Dayan added a comment -

          Hi Takashi,

          I'm attaching an Arabic PDF used as a testcase.

          Yigal

          ydayan Yigal Dayan added a comment - Hi Takashi, I'm attaching an Arabic PDF used as a testcase. Yigal

          With version 931224 the converter isn't used if an unicode mapping is given.

          @Yigal: Please test that version, if possible. I guess it is easier for you to check the result, as I'm not able to read arabic.

          lehmi Andreas Lehmkühler added a comment - With version 931224 the converter isn't used if an unicode mapping is given. @Yigal: Please test that version, if possible. I guess it is easier for you to check the result, as I'm not able to read arabic.
          ydayan Yigal Dayan added a comment -

          version 931224 fixed this issue, thanks!

          The other two issues I mentioned in the mail are still not fixed, so I'll go ahead and open bugs on them. Unfortunately I can't provide the fix, but I'll attach a pdf + results before fix and after fix, to make the problem stand out.

          ydayan Yigal Dayan added a comment - version 931224 fixed this issue, thanks! The other two issues I mentioned in the mail are still not fixed, so I'll go ahead and open bugs on them. Unfortunately I can't provide the fix, but I'll attach a pdf + results before fix and after fix, to make the problem stand out.

          Thanks for the feedback!

          Set this to resolved

          lehmi Andreas Lehmkühler added a comment - Thanks for the feedback! Set this to resolved
          frankee787 Franklin added a comment -

          I have tried the extraction of Arabic like this

          PDFParser parser = new PDFParser(is);
          parser.parse();
          cosDoc = parser.getDocument();

          String docText = null;
          PDFTextStripper stripper = new PDFTextStripper();
          PDDocument document = new PDDocument(cosDoc);
          docText = stripper.getText(document);

          The docText is proper arabic when I try it out with your pdf file (zzz.pdf).

          However I get ????? when I try it out with my pdf file.

          Is there any reason for this ?

          frankee787 Franklin added a comment - I have tried the extraction of Arabic like this PDFParser parser = new PDFParser(is); parser.parse(); cosDoc = parser.getDocument(); String docText = null; PDFTextStripper stripper = new PDFTextStripper(); PDDocument document = new PDDocument(cosDoc); docText = stripper.getText(document); The docText is proper arabic when I try it out with your pdf file (zzz.pdf). However I get ????? when I try it out with my pdf file. Is there any reason for this ?

          First of all, please don't use already closed issues!!

          Try to use a proper encoding. Have a look at org.apache.pdfbox.ExtractText for further details. If the issue still exists, file a new issue on JIRA and attach the pdf in question.

          lehmi Andreas Lehmkühler added a comment - First of all, please don't use already closed issues!! Try to use a proper encoding. Have a look at org.apache.pdfbox.ExtractText for further details. If the issue still exists, file a new issue on JIRA and attach the pdf in question.
          frankee787 Franklin added a comment -

          Extremely sorry about that.

          Thanks for the hints. I shall look into that and get back.

          Regards,
          Franklin

          frankee787 Franklin added a comment - Extremely sorry about that. Thanks for the hints. I shall look into that and get back. Regards, Franklin

          People

            Unassigned Unassigned
            lehmi Andreas Lehmkühler
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: