[PDFBOX-2158] ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.5, 2.0.0
Fix Version/s: 2.0.0
Component/s: Parsing, Text extraction
Labels:
None
Environment:
Windows x64

Description

Attached PDF file is missing most of the text when processed by the ExtractText example program

I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a rectange for COSName.FONT_BBOX that contained a ymin value of minus infinity. That method then creates a PDRectangle which calculates a bounding box with a ymin value of -65,329, and results in an enormous text size, and things go downhill from there. The text cannot be matched up, and most of it ends up being discarded.

I was able to hack a fix by doing a check in the constructor PDRectangle.PDRectangle( COSArray array ) for big negative numbers and setting them to 0. With that change, all the text came through as expected. However, I don't have enough familiarity with the code to understand what a real fix ought to look like.

The PDF file looks to be fine by other programs such as Acrobat and NitroPDF

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

negative.text.box.pdf
22/Jun/14 20:41
110 kB
Joel Hirsh
PDFBOX-2158-077702.pdf
03/Aug/14 13:27
1.65 MB
Tilman Hausherr

Issue Links

duplicates

PDFBOX-3130 Recent regression in PDFTextStripper, text getting garbled

Closed

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Joel Hirsh

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Jun/14 20:39

Updated:: 17/Mar/16 19:07

Resolved:: 24/Nov/15 20:22