Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1233

PDFBox can throw StringIndexOutOfBoundsException on some dates

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 1.5
    • Fix Version/s: 1.10
    • Component/s: parser
    • Labels:

      Description

      PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this "feature."

      Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from causing problems in TIKA

      @@ -171,6 +171,9 @@
                   addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate());
               } catch (IOException e) {
                   // Invalid date format, just ignore
      +        } catch (StringIndexOutOfBoundsException e){
      +            //remove after PDFBOX-1883 is fixed
      +            // Invalid date format, just ignore
               }
               try {
                   Calendar modified = info.getModificationDate();
      @@ -178,6 +181,9 @@
                   addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
               } catch (IOException e) {
                   // Invalid date format, just ignore
      +        } catch (StringIndexOutOfBoundsException e){
      +            //remove after PDFBOX-1883 is fixed
      +            // Invalid date format, just ignore
               }
      
      

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          Upgraded to PDFBox 1.8.10 with r1692341

          Show
          tallison@mitre.org Tim Allison added a comment - Upgraded to PDFBox 1.8.10 with r1692341
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #730 (See https://builds.apache.org/job/tika-trunk-jdk1.7/730/)
          TIKA-1233 reopened (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1683656)

          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #730 (See https://builds.apache.org/job/tika-trunk-jdk1.7/730/ ) TIKA-1233 reopened (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1683656 ) /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Kevin Jones opened PDFBOX-2823 for a similar issue, which Tilman Hausherr identified as one space vs. the old TIKA-1233 "no space" problem.

          I've added back the catch block in r1683656. Hopefully this will make it into Tika 1.9.

          I'll leave this open until PDFBox 1.8.10 is integrated.

          Show
          tallison@mitre.org Tim Allison added a comment - Kevin Jones opened PDFBOX-2823 for a similar issue, which Tilman Hausherr identified as one space vs. the old TIKA-1233 "no space" problem. I've added back the catch block in r1683656. Hopefully this will make it into Tika 1.9. I'll leave this open until PDFBox 1.8.10 is integrated.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Hindsight and current eval methodology turn out to be 20/20...at least in this case. I just ran Tika 1.5 against a small 10,000 pdf test set from govdocs1. There were 13 DateFormatter exceptions in that test set, by far the most common exception. With the current eval methodology (nascent TIKA-1302 code), we would have caught the importance of this before the 1.5 release.

          Show
          tallison@mitre.org Tim Allison added a comment - Hindsight and current eval methodology turn out to be 20/20...at least in this case. I just ran Tika 1.5 against a small 10,000 pdf test set from govdocs1. There were 13 DateFormatter exceptions in that test set, by far the most common exception. With the current eval methodology (nascent TIKA-1302 code), we would have caught the importance of this before the 1.5 release.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Luis Filipe Nassif, please reopen if you are still finding problems on your test set with trunk.

          Show
          tallison@mitre.org Tim Allison added a comment - Luis Filipe Nassif , please reopen if you are still finding problems on your test set with trunk.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          After upgrade to PDFBOX-1.8.5, confirmed no longer any need for catch blocks for StringIndexOutOfBoundsException. Catch blocks removed in r1593983.

          Show
          tallison@mitre.org Tim Allison added a comment - After upgrade to PDFBOX-1 .8.5, confirmed no longer any need for catch blocks for StringIndexOutOfBoundsException. Catch blocks removed in r1593983.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y. I believe the issue was caused by an upgrade to the date parser in PDFBox. PDFBOX-1803 has been fixed in trunk, but those mods still need to be made to 1.8...and then we have to wait until the next release of PDFBox.

          Show
          tallison@mitre.org Tim Allison added a comment - Y. I believe the issue was caused by an upgrade to the date parser in PDFBox. PDFBOX-1803 has been fixed in trunk, but those mods still need to be made to 1.8...and then we have to wait until the next release of PDFBox.
          Hide
          lfcnassif Luis Filipe Nassif added a comment - - edited

          I also got this with Tika 1.5 on ~1500 pdf files from my base of 8500 pdf files, did not with 1.4.

          Show
          lfcnassif Luis Filipe Nassif added a comment - - edited I also got this with Tika 1.5 on ~1500 pdf files from my base of 8500 pdf files, did not with 1.4.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Added extra catch blocks for now. r1566910. Once PDFBOX-1803 is applied, we should be able to get rid of all catch blocks and use isBad() (if that proposed modification is accepted).

          Show
          tallison@mitre.org Tim Allison added a comment - Added extra catch blocks for now. r1566910. Once PDFBOX-1803 is applied, we should be able to get rid of all catch blocks and use isBad() (if that proposed modification is accepted).

            People

            • Assignee:
              Unassigned
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development