Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3319

Caused by: java.lang.NullPointerException (and more!)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.24.1
    • Fix Version/s: None
    • Component/s: general
    • Labels:
      None
    • Environment:

      Windows 10
      Tika 1.24.1.jar
      Tika 1.24 python module
      python 3.9.2
      tesseract-ocr-w64-setup-v5.0.0-alpha.20201127
      (anything else that may be relevant?)

      Description

      So...in sum
      1) it somehow doesn't "point" to a parser? (but it kinda does...)
      2) it says that I'm excluding tesseract from tika....I don't know how this happened to begin with
      3) and now...urllib in python by using the tika package suddenly can't figure out tika exists...

      Please assist. Thank you in advance. 

      01 Tika-1.24.1.jar and 1.24 python module have been running well for months on my machine.
      02 Then I get tesseract and a couple other things to integrate with it.
      03 Then I upgrade python from 3.8.2 to 3.9.2
      04 So I have always set the windows 10 $env: variable to something like TIKA_SERVER_JAR="<yourpath>/tika-server.jar"
      05 Then I run the tika python module. I get this urllib problem....
      urllib.error.URLError: <urlopen error unknown url type: c>
      06 Supposedly this is fixed by setting the $env: variable to something like...
      TIKA_SERVER_JAR="file:///<yourpath>/tika-server.jar"
      07 So I do this and mess around with it; no dice.
      08 So then I'm trying to run Tika on powershell right?
      java -jar "C:\PATH\TO\tika-app-1.24.1.jar" --gui
      brings up the gui but it gives me these "Warnings" now...

       

      Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
      See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
      for optional dependencies.

      Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
      you've excluded the TesseractOCRParser from the default parser.
      Tesseract may dramatically slow down content extraction (TIKA-2359).
      As of Tika 1.15 (and prior versions), Tesseract is automatically called.
      In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
      Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      WARNING: org.xerial's sqlite-jdbc is not loaded.
      Please provide the jar on your classpath to parse sqlite files.
      See tika-parsers/pom.xml for the correct version.

      09 so now when I try to use the --gui to parse a file I have parsed before it shows this message...

       

      Apache Tika was unable to parse the documentApache Tika was unable to parse the documentat C:\CODING\Apache Tika\Test03.pdf.
      The full exception stack trace is included below:
      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@473cb131 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267) at java.desktop/javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1967) at java.desktop/javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2308) at java.desktop/javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:405) at java.desktop/javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:262) at java.desktop/javax.swing.AbstractButton.doClick(AbstractButton.java:369) at java.desktop/javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1020) at java.desktop/javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1064) at java.desktop/java.awt.Component.processMouseEvent(Component.java:6636) at java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) at java.desktop/java.awt.Component.processEvent(Component.java:6401) at java.desktop/java.awt.Container.processEvent(Container.java:2263) at java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5012) at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) at java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4919) at java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4548) at java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4489) at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) at java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2764) at java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at java.base/java.security.AccessController.doPrivileged(AccessController.java:391) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95) at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) at java.base/java.security.AccessController.doPrivileged(AccessController.java:391) at java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) at java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203) at java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124) at java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113) at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109) at java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) at java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90)Caused by: java.lang.NullPointerException at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44 more

      10 most notably these lines...

      A) org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@473cb131
      B) Caused by: java.lang.NullPointerException

      11 now here's my java -jar tika-app-1.24.1.jar --dump-current-config

      Mar 14, 2021 10:15:23 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
      See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
      for optional dependencies.

      Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
      you've excluded the TesseractOCRParser from the default parser.
      Tesseract may dramatically slow down content extraction (TIKA-2359).
      As of Tika 1.15 (and prior versions), Tesseract is automatically called.
      In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
      Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
      WARNING: org.xerial's sqlite-jdbc is not loaded.
      Please provide the jar on your classpath to parse sqlite files.
      See tika-parsers/pom.xml for the correct version.
      <?xml version="1.0" encoding="UTF-8" standalone="no"?>
      <properties>
      <!-for example: <mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"/>->
      <service-loader dynamic="true" loadErrorHandler="IGNORE"/>
      <encodingDetectors>
      <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
      </encodingDetectors>
      <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
      <detectors>
      <detector class="org.apache.tika.detect.DefaultDetector"/>
      </detectors>
      <parsers>
      <parser class="org.apache.tika.parser.DefaultParser"/>
      </parsers>
      </properties>

      12 any help would be greatly appreciated. 
      13A the odd thing is when I run something like...
      java -jar tika-app-1.24.1.jar -t Test03.pdf output.txt

      13B it will print the document text in powershell then print this below it (which I have never gotten before)...

      Exception in thread "main" java.net.MalformedURLException: no protocol: output.txt
      at java.base/java.net.URL.<init>(URL.java:672)
      at java.base/java.net.URL.<init>(URL.java:568)
      at java.base/java.net.URL.<init>(URL.java:515)
      at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:488)
      at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                rickuls Richard Kraus
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: