Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-684

Incorrect ordering of compound Arabic glyphs

    XMLWordPrintableJSON

Details

    Description

      Some Arabic PDFs contain compound glyphs for stylistic reasons.
      Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.

      Before a line gets sent to the bidirectional algorithm, all characters have been sorted into a visual order, except for these pairs. This is because they are handled as one unit and maintain their original (logical) order. The bidi algorithm straightens out most characters, but reverses the glyph pairs.

      To fix this, the output of font.encode() should be examined and reversed on the spot if it contains pairs of Arabic characters. Possibly you need to add a stub method to PDFStreamEngine (in method processEncodedText) that PDFTextStripper can override (in sort mode only).

      Attachments

        1. zzz.pdf
          247 kB
          Yigal Dayan
        2. zzz.before_fix.txt
          17 kB
          Yigal Dayan
        3. zzz.after_fix.txt
          17 kB
          Yigal Dayan
        4. PDFTextStripper.patch
          1 kB
          Yigal Dayan
        5. PDFStreamEngine.patch
          1 kB
          Yigal Dayan

        Issue Links

          Activity

            People

              jukkaz Jukka Zitting
              ydayan Yigal Dayan
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 3h
                  3h
                  Remaining:
                  Remaining Estimate - 3h
                  3h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified