[PDFBOX-5125] Slightly slanted line with right side higher than the left confuses PDFTextStripper with sortByPosition=true - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Bug
Affects Version/s: 2.0.22
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

The attached PDF, when run through PDFTextStripper with sortByPosition=true, yields improperly ordered text: the beginnings of lines are printed after the ends of the same lines, after a superfluous linebreak. There are also some additional erroneous linebreaks that do not result in the text reversing, like the one in "keretmegállapodásos".

PDFBox extracts:

lőállító eszközök szállítása és kapcsolódó szolgáltatások 2013”
„Nyomat e
árgyban lefolytatott központosított közbeszerzési keretmegállapodáso
s eljárás 2. része
t
(Általános Multifunkciós eszközök) eredményeképpen a Beszerző és El
adó között
keretmegállapodás jött létre (továbbiakban: KM).

The same PDF opened in Adobe Reader, and all the text in it copied out:

„Nyomat előállító eszközök szállítása és kapcsolódó szolgáltatások 2013”
tárgyban lefolytatott központosított közbeszerzési keretmegállapodásos eljárás 2. része
(Általános Multifunkciós eszközök) eredményeképpen a Beszerző és Eladó között
keretmegállapodás jött létre (továbbiakban: KM).

(The word "teljesítése" is missing in both extractions due to an OCR error; that's an issue with Tesseract an unrelated to this issue.)

In Firefox (pdf.js), we get:

„Nyomatelőállítóeszközökszállításaés kapcsolódószolgáltatások2013”tárgybanlefolytatottközpontosítottközbeszerzésikeretmegállapodásoseljárás2.  része(ÁltalánosMultifunkcióseszközök)eredményeképpena  Beszerzőés  Eladóközöttkeretmegállapodásjöttlétre(továbbiakban:KM).

(The missing spaces are a well-known incompatibility between Tesseract 4.0 and pdf.js, workarounded in Tesseract 4.1, but the order of the text remains correct.)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

BB-8541-1-ocr.pdf
08/Mar/21 22:07
71 kB
Gábor Stefanik

Activity

People

Assignee:: Unassigned

Reporter:: Gábor Stefanik

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Mar/21 22:18

Updated:: 11/Mar/21 18:26

Resolved:: 11/Mar/21 18:26