[PDFBOX-4236] PDFTextStripper diacritic merge sometimes chooses wrong base glyph - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.0.0 PDFBox
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

In the course of answering this stack overflow question I saw that text extraction from the example file pattern3.pdf exposes an error in the diacritic merging code, the wrong base glyph is chosen.

From the bottom of my answer there:

By the way, your test file exposes an error in the PDFBox determination of the base glyph to merge a diacritic with: The "स[1434]ु[1441]न[1418]" is meant to be rendered as "सुन", i.e. the vowel sign u "ु" is combined with the letter sa "स", but PDFBox combines it with the subsequent letter na "न" as "सनु".

The cause is that it determines the letter to combine the diacritic with by its origin which here indeed is in the range of the latter letter na "न", but as the vowel sign glyph is rendered before its origin (it is drawn in an area with a negative x coordinate), PDFBox determines the wrong association:

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SA-U-NA.png
04/Jun/18 15:53
50 kB
Michael Klink
pattern3.pdf
04/Jun/18 15:45
275 kB
Michael Klink

Activity

People

Assignee:: Unassigned

Reporter:: Michael Klink

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Jun/18 15:54

Updated:: 04/Jun/18 15:58