Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4019

Expected 'Page' but found COSName{Font} in PDPageTree

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.8
    • Fix Version/s: None
    • Component/s: PDModel, Text extraction
    • Labels:
      None
    • Environment:
      Debian 9 / MacOs (not OS related)

      Description

      Hello,

      I have a PDF document that produces the following stack trace :

      INFO: OpenType Layout tables used in font FreeSans are not implemented in PDFBox and will be ignored
      Exception in thread "Thread-1" java.lang.IllegalStateException: Expected 'Page' but found COSName{Font}
      	at org.apache.pdfbox.pdmodel.PDPageTree.sanitizeType(PDPageTree.java:227)
      	at org.apache.pdfbox.pdmodel.PDPageTree.access$300(PDPageTree.java:38)
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.next(PDPageTree.java:189)
      	at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.next(PDPageTree.java:153)
      	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:314)
      	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
      	at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
      

      I found a similar problem here https://mail-archives.apache.org/mod_mbox/pdfbox-users/201610.mbox/%3C2e858989-2fb9-d000-5320-b644fcc71f81@t-online.de%3E

      So, I understand that the problem comes from the pdf itself but given that some readers recover from it, is there any plan to add some recovery methods in PdfBox too?

      Thanks

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              nmarlk Nicolas M
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: