Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2158

ExtractText missing most of text in this PDF file, due to font bounding box with minus infinity

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.5, 2.0.0
    • 2.0.0
    • Parsing, Text extraction
    • None
    • Windows x64

    Description

      Attached PDF file is missing most of the text when processed by the ExtractText example program

      I traced it down to PDFontDescriptorDictionary.getFontBoundingBox() getting a rectange for COSName.FONT_BBOX that contained a ymin value of minus infinity. That method then creates a PDRectangle which calculates a bounding box with a ymin value of -65,329, and results in an enormous text size, and things go downhill from there. The text cannot be matched up, and most of it ends up being discarded.

      I was able to hack a fix by doing a check in the constructor PDRectangle.PDRectangle( COSArray array ) for big negative numbers and setting them to 0. With that change, all the text came through as expected. However, I don't have enough familiarity with the code to understand what a real fix ought to look like.

      The PDF file looks to be fine by other programs such as Acrobat and NitroPDF

      Attachments

        1. negative.text.box.pdf
          110 kB
          Joel Hirsh
        2. PDFBOX-2158-077702.pdf
          1.65 MB
          Tilman Hausherr

        Issue Links

          Activity

            People

              tilman Tilman Hausherr
              jhirsh Joel Hirsh
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: