Details
-
Type:
Improvement
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 2.0.8
-
Fix Version/s: None
-
Component/s: PDModel, Text extraction
-
Labels:None
-
Environment:Debian 9 / MacOs (not OS related)
Description
Hello,
I have a PDF document that produces the following stack trace :
INFO: OpenType Layout tables used in font FreeSans are not implemented in PDFBox and will be ignored Exception in thread "Thread-1" java.lang.IllegalStateException: Expected 'Page' but found COSName{Font} at org.apache.pdfbox.pdmodel.PDPageTree.sanitizeType(PDPageTree.java:227) at org.apache.pdfbox.pdmodel.PDPageTree.access$300(PDPageTree.java:38) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.next(PDPageTree.java:189) at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.next(PDPageTree.java:153) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:314) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
I found a similar problem here https://mail-archives.apache.org/mod_mbox/pdfbox-users/201610.mbox/%3C2e858989-2fb9-d000-5320-b644fcc71f81@t-online.de%3E
So, I understand that the problem comes from the pdf itself but given that some readers recover from it, is there any plan to add some recovery methods in PdfBox too?
Thanks