Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5879

Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic -i="MVM_Aram_augusztus.pdf" 

      fails with the following error:

      java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
              at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
              at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
              at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
              at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
              at picocli.CommandLine.access$1500(CommandLine.java:148)
              at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
              at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
              at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
              at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
              at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
              at picocli.CommandLine.execute(CommandLine.java:2174)
              at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) 

      The same command succeeds in 3.0.2.

      The triggering PDF can be downloaded from https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf, and is also attached.

      The root cause appears to be this change: https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2 from PDFBOX-5841

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tilman Tilman Hausherr
            Googulator Gábor Stefanik
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment