Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5682

Long/permanent hang in PDFBox 3.x

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 3.0.1 PDFBox, 4.0.0
    • None
    • None

    Description

      I found two files in the regression tests where we're now getting timeouts at 3 minutes where we weren't before. Unfortunately, PDFBox's export:text works on both, so it is probably another structural feature, perhaps a problem in Tika?

      This file halts after printing out the header for Table 19 on page 46: https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf

      Pure PDFBox's export:text complains multiple times: "Page skipped due to an invalid or missing type null, but it does finish quickly."

      This file halts after extracting "854,793,592":
      https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY

      Pure PDFBox's export:text processes this without problem.

      Attachments

        Activity

          People

            lehmi Andreas Lehmkühler
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: