Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-778

NullPointerException in tika-app, parsing PDF content

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0
    • 1.1
    • gui, parser
    • None

    Description

      I try to extract text from some pdf files with the tika app. In version 0.10 the error
      ERROR - Error: Could not parse predefined CMAP file for '--UCS2'
      is printed on the command line, but text extraction works and is correct.

      In version 1.0 I get the same error message on the command line, but also receive an exception and no text is extracted:
      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@62bc36ff
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
      at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
      at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
      at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995)
      at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2318)
      at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
      at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
      at javax.swing.AbstractButton.doClick(AbstractButton.java:357)
      at javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809)
      at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850)
      at java.awt.Component.processMouseEvent(Component.java:6288)
      at javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
      at java.awt.Component.processEvent(Component.java:6053)
      at java.awt.Container.processEvent(Container.java:2041)
      at java.awt.Component.dispatchEventImpl(Component.java:4651)
      at java.awt.Container.dispatchEventImpl(Container.java:2099)
      at java.awt.Component.dispatchEvent(Component.java:4481)
      at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4577)
      at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238)
      at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
      at java.awt.Container.dispatchEventImpl(Container.java:2085)
      at java.awt.Window.dispatchEventImpl(Window.java:2478)
      at java.awt.Component.dispatchEvent(Component.java:4481)
      at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:643)
      at java.awt.EventQueue.access$000(EventQueue.java:84)
      at java.awt.EventQueue$1.run(EventQueue.java:602)
      at java.awt.EventQueue$1.run(EventQueue.java:600)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
      at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
      at java.awt.EventQueue$2.run(EventQueue.java:616)
      at java.awt.EventQueue$2.run(EventQueue.java:614)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
      at java.awt.EventQueue.dispatchEvent(EventQueue.java:613)
      at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269)
      at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184)
      at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174)
      at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169)
      at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161)
      at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
      Caused by: java.lang.NullPointerException
      at com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(ToHTMLStream.java:907)
      at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273)
      at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
      at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
      at org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
      at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
      at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
      at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
      at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
      at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
      at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
      at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:216)
      at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:112)
      at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:323)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:96)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      ... 43 more

      I tried the same pdf files (and can switch forth and back between version 0.10 and 1.0, this behavior is stable) and it looks like the exact same pdfbox version is inside the tika-app-0.10.jar and tika-app-1.0.jar. It would be great if version 1.0 could do what 0.10 can. Sorry that I cannot provide the pdf.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mikemccand Michael McCandless
            bmathes Bastian Mathes
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment