[PDFBOX-2548] Problems with character extraction (fi ligature) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 1.8.7
Fix Version/s: None
Component/s: Text extraction
Labels:
None
Environment:
Windows7Professional JavaSE8 EclipseKepler

Description

favorite

I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font.

However, the text extraction now causes another type of problem. In my case, when the charater sequences "fi" or "fl" occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'ﬁ' and 'ﬂ' and sets a space character on their right side.

(Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l).

My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements).

Background: The given document is a wordbook text with very dense printed text.

see this link for code and output:
http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox

My question: is there anything what I can do to avoid this problem?

thanks in advance ...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

preflight.png
08/Dec/14 19:22
14 kB
John Hewson

Activity

People

Assignee:: Unassigned

Reporter:: Matthias Bösinger

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 08/Dec/14 16:21

Updated:: 09/Dec/14 11:45

Resolved:: 09/Dec/14 11:45