Tika
  1. Tika
  2. TIKA-778

NullPointerException in tika-app, parsing PDF content

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.1
    • Component/s: gui, parser
    • Labels:
      None

      Description

      I try to extract text from some pdf files with the tika app. In version 0.10 the error
      ERROR - Error: Could not parse predefined CMAP file for '--UCS2'
      is printed on the command line, but text extraction works and is correct.

      In version 1.0 I get the same error message on the command line, but also receive an exception and no text is extracted:
      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@62bc36ff
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
      at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
      at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
      at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995)
      at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2318)
      at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
      at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
      at javax.swing.AbstractButton.doClick(AbstractButton.java:357)
      at javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809)
      at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850)
      at java.awt.Component.processMouseEvent(Component.java:6288)
      at javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
      at java.awt.Component.processEvent(Component.java:6053)
      at java.awt.Container.processEvent(Container.java:2041)
      at java.awt.Component.dispatchEventImpl(Component.java:4651)
      at java.awt.Container.dispatchEventImpl(Container.java:2099)
      at java.awt.Component.dispatchEvent(Component.java:4481)
      at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4577)
      at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238)
      at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
      at java.awt.Container.dispatchEventImpl(Container.java:2085)
      at java.awt.Window.dispatchEventImpl(Window.java:2478)
      at java.awt.Component.dispatchEvent(Component.java:4481)
      at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:643)
      at java.awt.EventQueue.access$000(EventQueue.java:84)
      at java.awt.EventQueue$1.run(EventQueue.java:602)
      at java.awt.EventQueue$1.run(EventQueue.java:600)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
      at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
      at java.awt.EventQueue$2.run(EventQueue.java:616)
      at java.awt.EventQueue$2.run(EventQueue.java:614)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
      at java.awt.EventQueue.dispatchEvent(EventQueue.java:613)
      at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269)
      at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184)
      at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174)
      at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169)
      at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161)
      at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
      Caused by: java.lang.NullPointerException
      at com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(ToHTMLStream.java:907)
      at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273)
      at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
      at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
      at org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
      at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
      at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
      at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
      at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
      at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
      at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
      at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:216)
      at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:112)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:323)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:96)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      ... 43 more

      I tried the same pdf files (and can switch forth and back between version 0.10 and 1.0, this behavior is stable) and it looks like the exact same pdfbox version is inside the tika-app-0.10.jar and tika-app-1.0.jar. It would be great if version 1.0 could do what 0.10 can. Sorry that I cannot provide the pdf.

        Activity

        Hide
        Jukka Zitting added a comment -

        Looks like the problem is coming from the HTML serializer rather than from PDFBox.

        I can't reproduce this locally. Instead of sharing the test PDF here publicly, is it possible for you to send it to me in private? Alternatively, can you check if the following command works as expected (should produce valid HTML):

        $ java -jar tika-app-1.0.jar --html /path/to/test.pdf
        
        Show
        Jukka Zitting added a comment - Looks like the problem is coming from the HTML serializer rather than from PDFBox. I can't reproduce this locally. Instead of sharing the test PDF here publicly, is it possible for you to send it to me in private? Alternatively, can you check if the following command works as expected (should produce valid HTML): $ java -jar tika-app-1.0.jar --html /path/to/test.pdf
        Hide
        Bastian Mathes added a comment -

        Calling the extraction directly on the command line actually works (with or without --html), so the issue is probably not as important that I thought, it is just that opening it from within the Tika application causes this exception (in 1.0, not in 0.10). I send you a PDF via mail.

        Show
        Bastian Mathes added a comment - Calling the extraction directly on the command line actually works (with or without --html), so the issue is probably not as important that I thought, it is just that opening it from within the Tika application causes this exception (in 1.0, not in 0.10). I send you a PDF via mail.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Bastian Mathes
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development