Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4531

Extraction of Arabic PDF has incorrect ordering of normalized ligatures

    XMLWordPrintableJSON

Details

    Description

      As reported by Elias Peterson in the mailing list:

      I think I'm seeing some issues concerning the handling of the Arabic lam-with-alef ligature. I'm attempting to process the PDF here:
      https://www.rand.org/content/dam/rand/pubs/perspectives/PE100/PE122/RAND_PE122z1.arabic.pdf

      When I run the ExtractText command with 2.0.15 I get the following:
      $ java -jar pdfbox-app-2.0.15.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf output.txt
      $ head output.txt
      C O R P O R A T I O N
      منظور تحليلي
      رؤى خبير بشأن قضايا السياسات اآلنية
      االتفاق مع إيران
      األيام التي تلي
      ...

      The issue being with the last two lines in the above snippet where my understanding is that the ligature لا was normalized but that the two letters that compose it are in the wrong order. I was thinking that PDFBOX-684 sounded similar, and running the same PDF through 1.8.16 I see the ligature is normalized in the way I think is expected (although the interspersed English-language words are backwards here).

      $ java -jar pdfbox-app-1.8.16.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf output.txt
      ...
      $ head output.txt
      N O I T A R O P R O C
      منظور تحليلي
      رؤى خبير بشأن قضايا السياسات الآنية
      الاتفاق مع إيران
      الأيام التي تلي
      ...

      Attachments

        1. content_diffs_with_exceptions.xlsx
          744 kB
          Tilman Hausherr
        2. bidi-ligature-2.pdf
          39 kB
          Masaki Komedani
        3. bidi-ligature-1.pdf
          5 kB
          Masaki Komedani
        4. bidi-ligature.patch
          2 kB
          Masaki Komedani
        5. diff-output.zip
          109 kB
          Tilman Hausherr
        6. PDFBOX-679-toobig.pdf
          247 kB
          Tilman Hausherr
        7. artikel1_20_arab.pdf
          1.55 MB
          Tilman Hausherr
        8. FES-GGArabisch-p112.pdf
          128 kB
          Tilman Hausherr
        9. RAND_PE122z1.arabic.pdf
          227 kB
          Tilman Hausherr
        10. PDFBOX-4531-reduced.pdf
          7 kB
          Tilman Hausherr

        Issue Links

          Activity

            People

              tilman Tilman Hausherr
              tilman Tilman Hausherr
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: