Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1737

PDFBox 1.8.10 is still a basket case

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.10
    • None
    • general
    • None
    • Linux, Solaris

    Description

      In TIKA-1471 I reported OOM errors when parsing PDF files. According to that bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather than PDFBox being better it's actually far, far worse. With the same corpus, Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 1.8.10) has 453 exceptions thrown by PDFBox. Not only that, but as far as I can tell, the memory leaks are even worse in 1.8.10 as well.

      I've had to resort to destroying the Tika instances and starting over each time there's an error indexing a PDF file. It's so bad I'm going to switch to running pdftotext (part of Xpdf) as an external process. Note that many of the errors in PDFBox are clearly caused by programming errors, e.g. ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and EOFException.

      I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a replacement for PDFBox as 1.8.10 just isn't fit for purpose.

      Attachments

        1. pdfbox.txt
          184 kB
          Alan Burlison

        Issue Links

          Activity

            People

              Unassigned Unassigned
              alanbur Alan Burlison
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: