Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-465

invalid date formats

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0-incubator
    • 1.8.3, 2.0.0
    • Parsing
    • None

    Description

      This is with the latest from svn, Revision: 773978

      From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,

      It seems that all of these have the stack trace of,

      Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
      at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
      at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
      at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
      at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
      at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)

      Some examples of invalid dates are,

      20070430193647+713'00'
      Tue Aug 21 10:35:22 2007
      Tuesday, November 04, 2008
      200712172:2:3
      Unknown
      20090319 200122
      9:47 5/12/2008

      i don't think there is any hope of parsing all these date formats. If would be nice if this was not a fatal error, and the parser could continue without a creation date.

      Is the policy of pdfbox to be as forgiving as possible when reading pdf documents? Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.

      Attachments

        1. SimpleDateParsingTest.java
          5 kB
          Peter_Lenahan@ibi.com

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              sgbridges Sean Bridges
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: