[PDFBOX-2023] Text extraction gets zero font height for type3 fonts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.0
Component/s: Text extraction
Labels:
- type3

Description

Fred Andrews posted this to the user list and I can confirm that text extraction gets nothing:

I am using PDFTextStripper on some PDF statements from Bank of America, and everything is coming through as zero height. I traced it down to getFontHeight in org.apache.pdfbox.pdmodel.font.PDSimpleFont, which is indeed getting zero. The font is a type 3 font and I'm not sure how it should work, but getFontHeight is calling getAFM() and that is returning a null because its not a type 1 font. Then in the next section in getFontHeight there are no font descriptors, and the zero just flows through all the way through getFontHeight.

I searched for anything I could key on to calculate the font height but couldn't find it. The font size is claimed to be 20 by getFontSize(), although it appears to be more like 8. I did trace to where it got a font size command of twenty, but somehow I'm assuming that would need to be scaled, and I can't see where that might come from.

The font width on the other hand looks accurate, and I would think something similar to that would be needed, but would really appreciate some guidance on how it should work. If I have clue on how it should work I can see what I can do to implement it.

This file displays fine in Acrobat and edits fine in Nitro, so it can't be that invalid.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

zero_height.pdf
17/Jul/14 06:02
67 kB
Joel Hirsh
PDFBOX-2023.pdf
11/Apr/14 05:10
51 kB
Tilman Hausherr

Activity

People

Assignee:: Unassigned

Reporter:: Tilman Hausherr

Votes:: 4 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Apr/14 05:07

Updated:: 17/Mar/16 19:06

Resolved:: 03/Feb/15 18:58