Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.15
Description
As reported by Elias Peterson in the mailing list:
I think I'm seeing some issues concerning the handling of the Arabic lam-with-alef ligature. I'm attempting to process the PDF here:
https://www.rand.org/content/dam/rand/pubs/perspectives/PE100/PE122/RAND_PE122z1.arabic.pdfWhen I run the ExtractText command with 2.0.15 I get the following:
$ java -jar pdfbox-app-2.0.15.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf output.txt
$ head output.txt
C O R P O R A T I O N
منظور تحليلي
رؤى خبير بشأن قضايا السياسات اآلنية
االتفاق مع إيران
األيام التي تلي
...The issue being with the last two lines in the above snippet where my understanding is that the ligature لا was normalized but that the two letters that compose it are in the wrong order. I was thinking that
PDFBOX-684sounded similar, and running the same PDF through 1.8.16 I see the ligature is normalized in the way I think is expected (although the interspersed English-language words are backwards here).$ java -jar pdfbox-app-1.8.16.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf output.txt
...
$ head output.txt
N O I T A R O P R O C
منظور تحليلي
رؤى خبير بشأن قضايا السياسات الآنية
الاتفاق مع إيران
الأيام التي تلي
...
Attachments
Attachments
Issue Links
- relates to
-
PDFBOX-4481 Text extraction error with Thai combined glyph depending on space after it
- Open
-
PDFBOX-684 Incorrect ordering of compound Arabic glyphs
- Closed
- links to