Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2823

StringIndexOutOfBoundsException when doing DateConverter.parseDate()

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.9, 1.8.10
    • Fix Version/s: 1.8.10
    • Component/s: Parsing
    • Labels:
      None

      Description

      From Kevin J. in the user mailing list:

      We are currently using Apache Solr / Tika to index documents for searching. The exact version that is being used is version 1.8.8 of PDFBox.

      We can across a document that produced this stack trace (trimmed to the relevant part of PDFBox):

      Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 1
              at java.lang.String.charAt(String.java:658)
              at org.apache.pdfbox.util.DateConverter.parseDate(DateConverter.java:679)
              at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:808)
              at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:780)
              at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:753)
              at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:849)
              at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:212)
      

      Inspection of the document's binary revealed that it contained a creationDate consisting of a single white space (ASCII 0x20), which is probably illegal. I managed to create a small reproduction of the error using like so:

      File file = new File("/path/to/document/bad.pdf");
      InputStream stream = new FileInputStream(file);
      PDFParser parser = new PDFParser(stream);
      parser.parse();
      PDDocumentInformation info = parser.getPDDocument().getDocumentInformation();
      Calendar creationDate = info.getCreationDate();
      System.out.println(creationDate.toString());
      

      Which produces the same stack trace. I verified this against the latest build from the site on 1.8.9, and the behavior remains. This looks very similar to PDFBOX-1803, however that issue is marked as resolved in 1.8.5. So, my questions:

      • Is the exception an expected behavior? Ideally Tika would just index the document anyway, the creation date isn't important to us. Tika had an issue for this, TIKA-1233, that marks it as fixed by swallowing the exception, but looking at the comments for it, they removed the try/catch in r1593983 since it is marked as fixed here.
      • Is this a regression, or slightly different somehow from 1803? Shall I create a new issue or get the existing 1803 re-opened?
      • The PDF that reproduces the issue can be downloaded here: https://www.dropbox.com/s/tll5rscrlt95xuc/bad.pdf?dl=0
      1. PDFBOX-2823.pdf
        42 kB
        Tilman Hausherr

        Issue Links

          Activity

          Hide
          tilman Tilman Hausherr added a comment -

          I can confirm that it happens with 1.8.9, but not for 2.0. I don't think it is a regression... Yours is "one space and then nothing", the old issue was "nothing".

          Show
          tilman Tilman Hausherr added a comment - I can confirm that it happens with 1.8.9, but not for 2.0. I don't think it is a regression... Yours is "one space and then nothing", the old issue was "nothing".
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1683595 from Tilman Hausherr in branch 'pdfbox/branches/1.8'
          [ https://svn.apache.org/r1683595 ]

          PDFBOX-2823: avoid StringIndexOutOfBoundsException with one space date, better exception

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1683595 from Tilman Hausherr in branch 'pdfbox/branches/1.8' [ https://svn.apache.org/r1683595 ] PDFBOX-2823 : avoid StringIndexOutOfBoundsException with one space date, better exception
          Hide
          vcsjones Kevin Jones added a comment -

          Thank you very much!

          Show
          vcsjones Kevin Jones added a comment - Thank you very much!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          added catch block back to Tika...hopefully before 1.9 rc2 is cut.

          r1683656

          Show
          tallison@mitre.org Tim Allison added a comment - added catch block back to Tika...hopefully before 1.9 rc2 is cut. r1683656

            People

            • Assignee:
              tilman Tilman Hausherr
              Reporter:
              tilman Tilman Hausherr
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development