Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1233

PDFBox can throw StringIndexOutOfBoundsException on some dates

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • 1.5
    • 1.10
    • parser

    Description

      PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this "feature."

      Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from causing problems in TIKA

      @@ -171,6 +171,9 @@
                   addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate());
               } catch (IOException e) {
                   // Invalid date format, just ignore
      +        } catch (StringIndexOutOfBoundsException e){
      +            //remove after PDFBOX-1883 is fixed
      +            // Invalid date format, just ignore
               }
               try {
                   Calendar modified = info.getModificationDate();
      @@ -178,6 +181,9 @@
                   addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
               } catch (IOException e) {
                   // Invalid date format, just ignore
      +        } catch (StringIndexOutOfBoundsException e){
      +            //remove after PDFBOX-1883 is fixed
      +            // Invalid date format, just ignore
               }
      
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: