[PDFBOX-4768] Unable to extract text from PDF - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Bug
Affects Version/s: 2.0.18
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

I have a PDF document (see attachment) that can be viewed in Evince, but tika text extraction does not work. I think that this is due to a crash in pdfbox.

I'm also a bit puzzled by the message: "You do not have permission to extract text".

Here the output of the ExtractText command:

java -jar pdfbox-app-2.0.19-20200206.060243-86.jar ExtractText kst-31430-3-b3_unextractable.pdf tekst_jan.txt
Feb 07, 2020 11:03:15 AM org.apache.pdfbox.pdfparser.COSParser validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 211564, length: 3336, expected end position: 214900
Feb 07, 2020 11:03:15 AM org.apache.pdfbox.pdfparser.COSParser parseCOSStream
WARNING: stream ends with 'endobj' instead of 'endstream' at offset 225134
Exception in thread "main" java.io.IOException: You do not have permission to extract text
{{ at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:223)}}
{{ at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)}}
{{ at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)}}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

kst-31430-3-b3_unextractable.pdf
07/Feb/20 10:11
630 kB
Jan Vlug

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Jan Vlug

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Feb/20 10:11

Updated:: 08/Feb/20 11:37

Resolved:: 08/Feb/20 09:18