Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1233

PDFBox can throw StringIndexOutOfBoundsException on some dates

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 1.5
    • Fix Version/s: 1.10
    • Component/s: parser
    • Labels:

      Description

      PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this "feature."

      Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from causing problems in TIKA

      @@ -171,6 +171,9 @@
                   addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate());
               } catch (IOException e) {
                   // Invalid date format, just ignore
      +        } catch (StringIndexOutOfBoundsException e){
      +            //remove after PDFBOX-1883 is fixed
      +            // Invalid date format, just ignore
               }
               try {
                   Calendar modified = info.getModificationDate();
      @@ -178,6 +181,9 @@
                   addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
               } catch (IOException e) {
                   // Invalid date format, just ignore
      +        } catch (StringIndexOutOfBoundsException e){
      +            //remove after PDFBOX-1883 is fixed
      +            // Invalid date format, just ignore
               }
      
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tallison@mitre.org Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: