[PDFBOX-4101] Word ordering / line detection failures in text extraction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.8
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

Dear Apache contributors,

I am a user of pdfbox mainly for the purpose of text extraction. The word ordering is not correct for some cases and the line detection may fail too.

Attachments:

1st page: the first letter D is not written before "uis sit amet..." but at the end of the page ;
2nd page: the sentence "scolaire ferry" is just before "réouverture du musée" which is wrong because it's not on the same column ;

To manage these cases would be more than welcome A.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

fails_line_detection.pdf
06/Feb/18 21:26
103 kB
Alexandre
fails_line_detection-sort.txt
06/Feb/18 21:49
6 kB
Tilman Hausherr
fails_line_detection-unsort.txt
06/Feb/18 21:49
6 kB
Tilman Hausherr
hardtests-11.png
06/Feb/18 21:53
1.75 MB
Alexandre

Activity

People

Assignee:: Unassigned

Reporter:: Alexandre

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Feb/18 21:31

Updated:: 07/Feb/18 05:45