Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4986

Text can't be extracted from a document

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Bug
    • Affects Version/s: 2.0.21
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Windows 10, AdoptOpenJDK 11.0.8, 64-bit

      Description

      Hello everyone,

       

      PDFBox is not able to extract text from the attached document. It can only extract the first page with "Please wait...". The other pages are missing. I've also tried loading it in PDFDebugger, but it shows the first page only. I can open the document fine in Adobe and see all the text fine. I suspect it's some kind of dynamically generated content.

       

      Sample code to reproduce the issue:

      try (PDDocument document = PDDocument.load(new File("c0015_re_1375881383129_eng[1].pdf"), "")) {
      	PDFTextStripper stripper = new PDFTextStripper();
      	String text = stripper.getText(document);
      	System.out.println("Text: " + text);
      }
      

       

      Thanks.

        Attachments

        1. screenshot-1.png
          88 kB
          Tilman Hausherr
        2. c0015_re_1375881383129_eng[1].pdf
          96 kB
          Igor

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              igor35 Igor
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: