Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5153

New flatefilter exception on Tika unit test files with 3.0.0-RC1

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • 3.0.0 PDFBox
    • 3.0.0 PDFBox
    • Parsing
    • None

    Description

      On TIKA-3347, we're integrating PDFBox 3.0.0-RC1. We're getting new flate filter exceptions on a set of files that I think I created with PDFBox a while ago.

      Looks like we're also getting xref exceptions.

      I would not be surprised in the least to learn that I did something wrong in the creation of these files and that they are corrupt!

      I can replicate this issue with java -jar pdfbox-app-3.0.0-RC1.jar export:text

      SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
      Error extracting text for document [IOException]: java.util.zip.DataFormatException: invalid block type
      

      One of the files: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_no_extract_yes_accessibility_owner_user.pdf

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lehmi Andreas Lehmkühler
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment