[PDFBOX-4986] Text can't be extracted from a document - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Bug
Affects Version/s: 2.0.21
Fix Version/s: None
Component/s: Text extraction
Labels:
None
Environment:
Windows 10, AdoptOpenJDK 11.0.8, 64-bit

Description

Hello everyone,

PDFBox is not able to extract text from the attached document. It can only extract the first page with "Please wait...". The other pages are missing. I've also tried loading it in PDFDebugger, but it shows the first page only. I can open the document fine in Adobe and see all the text fine. I suspect it's some kind of dynamically generated content.

Sample code to reproduce the issue:

try (PDDocument document = PDDocument.load(new File("c0015_re_1375881383129_eng[1].pdf"), "")) {
	PDFTextStripper stripper = new PDFTextStripper();
	String text = stripper.getText(document);
	System.out.println("Text: " + text);
}

Thanks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

c0015_re_1375881383129_eng[1].pdf
12/Oct/20 01:47
96 kB
Igor
screenshot-1.png
13/Oct/20 17:34
88 kB
Tilman Hausherr

Activity

People

Assignee:: Unassigned

Reporter:: Igor

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Oct/20 01:56

Updated:: 14/Oct/20 00:10

Resolved:: 13/Oct/20 17:35