Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5153

New flatefilter exception on Tika unit test files with 3.0.0-RC1

    XMLWordPrintableJSON

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 3.0.0 PDFBox
    • Fix Version/s: 3.0.0 PDFBox
    • Component/s: Parsing
    • Labels:
      None

      Description

      On TIKA-3347, we're integrating PDFBox 3.0.0-RC1. We're getting new flate filter exceptions on a set of files that I think I created with PDFBox a while ago.

      Looks like we're also getting xref exceptions.

      I would not be surprised in the least to learn that I did something wrong in the creation of these files and that they are corrupt!

      I can replicate this issue with java -jar pdfbox-app-3.0.0-RC1.jar export:text

      SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
      Error extracting text for document [IOException]: java.util.zip.DataFormatException: invalid block type
      

      One of the files: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_no_extract_yes_accessibility_owner_user.pdf

        Attachments

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              tallison Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: