Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1084

java.lang.NumberFormatException when getting PDF text of some PDF file if dup line does not contains font index

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0, 1.6.0
    • Fix Version/s: 1.8.0
    • Component/s: Text extraction
    • Labels:

      Description

      Get the following exception when getting text of some PDF if dup line does not contains font index (I can send a sample PDF file)
      java.lang.NumberFormatException: For input string: "8#40"
      at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
      at java.lang.Integer.parseInt(Integer.java:458)
      at java.lang.Integer.parseInt(Integer.java:499)
      at org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:341)
      at org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:276)
      at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
      at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
      at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
      at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
      at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
      at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
      at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
      at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
      at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
      at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
      at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:255)

      Suggested correction is :
      in org.apache.pdfbox.pdmodel.font.PDType1Font.java in method getEncodingFromFont add try/catch block line 341 to avoid java.lang.NumberFormatException if dup line does not contains font index.

        Attachments

        1. w_4_2007.pdf
          225 kB
          Sandra Grenier

          Issue Links

            Activity

              People

              • Assignee:
                lehmi Andreas Lehmkühler
                Reporter:
                sgrenier Sandra Grenier
              • Votes:
                1 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: