Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1988

PDFBox ExtractText issue of PDF with no embedded fonts

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.4
    • Fix Version/s: 1.8.5, 2.0.0
    • Component/s: Rendering, Text extraction
    • Labels:
    • Environment:
      Windows 7
      Also, PASE on IBM i

      Description

      I have been using PDFBox 1.8.4 to extract text from several different PDF files fine. I use the latest PDFBox app with ExtractText command line. There is one PDF that PDFBox (and iText) fails to extract any text even though I can extract the text with Adobe Reader and also pdftotext.exe part of XPdf. "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt". I don't want to have to rely on using pdftotext.exe from a PC since this is part of an automated application. I think the error relates to an unknown font type and having to use the few fonts installed in the jar file. I tried running the API classes and trying to force a font from a certain location but I still got errors. I thought I loaded the font with the loadTTF method but I don't know if that did anything with the font. I would really like to have this working straight from the ExtractText class anyway.
      Here are the errors I am getting. I tried this from both a Windows 7 PC and our IBM i in the PASE environment but I get the same errors. The section starting processEncodedText and on repeats a few times so I just included the first entries.

      Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont
      WARNING: Substituting TrueType for unknown font subtype=
      Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
      WARNING: java.lang.NullPointerException
      Throwable occurred: java.lang.NullPointerException
      at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
      at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)

      at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:119)
      at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
      at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)
      at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604)
      at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)
      at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
      at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
      at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
      at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
      at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
      at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
      at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
      Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processEncodedText
      WARNING: java.lang.NullPointerException

      Throwable occurred: java.lang.NullPointerException
      at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
      at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
      at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
      at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
      at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
      at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
      at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
      at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
      at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
      Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
      WARNING: java.lang.NullPointerException
      Throwable occurred: java.lang.NullPointerException
      at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:364)

      at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
      at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
      at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
      at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
      at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
      at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)
      at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
      at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)

      Thanks,

      Craig Strong

        Attachments

        1. Test1.pdf
          4 kB
          Craig Strong

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              craigstrong Craig Strong
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 120h
                120h
                Remaining:
                Remaining Estimate - 120h
                120h
                Logged:
                Time Spent - Not Specified
                Not Specified