Description
This is with the latest from svn, Revision: 773978
From a sample of 13304 pdf documents generated in a very wide variety of ways, I got 94 invalid date formats,
It seems that all of these have the stack trace of,
Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
at org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
at org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
at message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
Some examples of invalid dates are,
20070430193647+713'00'
Tue Aug 21 10:35:22 2007
Tuesday, November 04, 2008
200712172:2:3
Unknown
20090319 200122
9:47 5/12/2008
i don't think there is any hope of parsing all these date formats. If would be nice if this was not a fatal error, and the parser could continue without a creation date.
Is the policy of pdfbox to be as forgiving as possible when reading pdf documents? Maybe toCalendar should return a new Calendar() if parsing fails, rather than throwing.
Attachments
Attachments
Issue Links
- incorporates
-
PDFBOX-643 Date conversion errors
- Closed
-
PDFBOX-701 Additional date formats
- Closed
- is superceded by
-
PDFBOX-1633 DateConverter needs to work
- Closed