Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4986

Text can't be extracted from a document

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Bug
    • 2.0.21
    • None
    • Text extraction
    • None
    • Windows 10, AdoptOpenJDK 11.0.8, 64-bit

    Description

      Hello everyone,

       

      PDFBox is not able to extract text from the attached document. It can only extract the first page with "Please wait...". The other pages are missing. I've also tried loading it in PDFDebugger, but it shows the first page only. I can open the document fine in Adobe and see all the text fine. I suspect it's some kind of dynamically generated content.

       

      Sample code to reproduce the issue:

      try (PDDocument document = PDDocument.load(new File("c0015_re_1375881383129_eng[1].pdf"), "")) {
      	PDFTextStripper stripper = new PDFTextStripper();
      	String text = stripper.getText(document);
      	System.out.println("Text: " + text);
      }
      

       

      Thanks.

      Attachments

        1. c0015_re_1375881383129_eng[1].pdf
          96 kB
          Igor
        2. screenshot-1.png
          88 kB
          Tilman Hausherr

        Activity

          People

            Unassigned Unassigned
            igor35 Igor
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: