[PDFBOX-4531] Extraction of Arabic PDF has incorrect ordering of normalized ligatures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.15
Fix Version/s: 2.0.28, 3.0.0 PDFBox
Component/s: Text extraction
Labels:
- Arabic
- regression

Description

As reported by Elias Peterson in the mailing list:

I think I'm seeing some issues concerning the handling of the Arabic lam-with-alef ligature. I'm attempting to process the PDF here:
https://www.rand.org/content/dam/rand/pubs/perspectives/PE100/PE122/RAND_PE122z1.arabic.pdf

When I run the ExtractText command with 2.0.15 I get the following:
$ java -jar pdfbox-app-2.0.15.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf output.txt
$ head output.txt
C O R P O R A T I O N
منظور تحليلي
رؤى خبير بشأن قضايا السياسات اآلنية
االتفاق مع إيران
األيام التي تلي
...

The issue being with the last two lines in the above snippet where my understanding is that the ligature لا was normalized but that the two letters that compose it are in the wrong order. I was thinking that ~~PDFBOX-684~~ sounded similar, and running the same PDF through 1.8.16 I see the ligature is normalized in the way I think is expected (although the interspersed English-language words are backwards here).

$ java -jar pdfbox-app-1.8.16.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf output.txt
...
$ head output.txt
N O I T A R O P R O C
منظور تحليلي
رؤى خبير بشأن قضايا السياسات الآنية
الاتفاق مع إيران
الأيام التي تلي
...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

content_diffs_with_exceptions.xlsx
01/Apr/23 08:37
744 kB
Tilman Hausherr
bidi-ligature-2.pdf
22/Feb/23 06:00
39 kB
Masaki Komedani
bidi-ligature-1.pdf
22/Feb/23 06:00
5 kB
Masaki Komedani
bidi-ligature.patch
20/Feb/23 06:05
2 kB
Masaki Komedani
diff-output.zip
19/Feb/23 13:10
109 kB
Tilman Hausherr
PDFBOX-679-toobig.pdf
19/Feb/23 08:56
247 kB
Tilman Hausherr
artikel1_20_arab.pdf
19/Feb/23 08:55
1.55 MB
Tilman Hausherr
FES-GGArabisch-p112.pdf
19/Feb/23 08:55
128 kB
Tilman Hausherr
RAND_PE122z1.arabic.pdf
30/Apr/19 09:06
227 kB
Tilman Hausherr
PDFBOX-4531-reduced.pdf
30/Apr/19 09:05
7 kB
Tilman Hausherr

Issue Links

relates to

PDFBOX-4481 Text extraction error with Thai combined glyph depending on space after it

Open

PDFBOX-684 Incorrect ordering of compound Arabic glyphs

Closed

links to

GitHub Pull Request #154

GitHub Pull Request #156

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Tilman Hausherr

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 30/Apr/19 09:04

Updated:: 13/Apr/23 14:54

Resolved:: 26/Feb/23 09:05