Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1744

Be resilient to PDFs with missing version info

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.8.2
    • 1.8.3, 2.0.0
    • Parsing
    • None
    • PDFBox 1.8.2, IntelliJ IDEA 12.1.6, Mac OS X 10.7.5, Java 1.7, Maven 2.2.1

    Description

      Proposed addition to 1.8.2 -> pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java -> parseHeader() to default the PDF version to 1.4 in situations where it is missing (yes, there really are docs out there like this!).
      This prevents an exception caused from a negative substring offset calculation: "String index out of range: -3"

      I have floated the question on the users@pdfbox.apache.org mailing list (10th October 2013) and it was suggested I default the PDF version to 1.4 in this scenario. I have tested it locally and it works (apparently PDFBox doesn't take the version number into account anyway).

      Now over to you guys to decide if this is a good idea or not in the wider scope.

      Should you give the green light, I attach:
      1) a sample file which causes the exception
      2) a patch file
      3) patching instructions.

      My goal is text extraction, even on broken files (if possible).

      Attachments

        1. pdfbox.patch
          2 kB
          Chris Bamford
        2. no_version.pdf
          0.0 kB
          Chris Bamford

        Activity

          People

            lehmi Andreas Lehmkühler
            bammers Chris Bamford
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: