Description
I am using PDFBox 1.7.1 and when parsing some PDF files, it is throwing exceptions and it's filling the Tomcat log very quickly (100MB in few seconds). There was another bug filed related to this issue. I tried the patch supplied in that bug but the issue is still there. I want to mention that the text gets extracted successfully from the PDF. But it just throws a log of WARN messages in the logs. As a workaround, I have set the LOG level to ERROR to avoid those WARN messages.
Here is the problematic PDF file:
http://doratst.uark.edu/fedora/repository/default%3A1590/OBJ/Traveler20120822.pdf
Related bug:
https://issues.apache.org/jira/browse/PDFBOX-1359#comment-13584669
I am getting the following exception:
WARN 2013-02-22 14:41:19,519 (PDFStreamEngine) java.lang.NullPointerException
java.lang.NullPointerException
WARN 2013-02-22 14:41:19,519 (PDFStreamEngine) java.lang.NullPointerException
java.lang.NullPointerException
WARN 2013-02-22 14:41:19,519 (PDFStreamEngine) java.io.IOException: Error: Could not find font(COSName
java.io.IOException: Error: Could not find font(COSName{F53.0}
) in map=
{F50.1=org.apache.pdfbox.pdmodel.font.PDType1Font@50246923, F51.0=org.apache.pdfbox.pdmodel.font.PDType1Font@672a1f0} at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:57)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:556)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:270)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:556)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:270)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:556)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:270)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:247)
at dk.defxws.fedoragsearch.server.TransformerToText.getTextFromPDF(TransformerToText.java:335)
at dk.defxws.fedoragsearch.server.TransformerToText.getText(TransformerToText.java:194)
at dk.defxws.fedoragsearch.server.GenericOperationsImpl.getDatastreamText(GenericOperationsImpl.java:668)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.xalan.extensions.ExtensionHandlerJavaClass.callFunction(ExtensionHandlerJavaClass.java:399)
at org.apache.xalan.extensions.ExtensionHandlerJavaClass.callFunction(ExtensionHandlerJavaClass.java:438)
at org.apache.xalan.extensions.ExtensionsTable.extFunction(ExtensionsTable.java:220)
at org.apache.xalan.transformer.TransformerImpl.extFunction(TransformerImpl.java:473)
at org.apache.xpath.functions.FuncExtFunction.execute(FuncExtFunction.java:206)
at org.apache.xpath.Expression.executeCharsToContentHandler(Expression.java:311)