Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1845

Unable to extract content from certain RTFs using tika-server versions since 1.5

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6, 1.9, 1.11
    • 1.13
    • server
    • None
    • Windows

    Description

      I have some patient letters that are RTF documents. When I extract the text from these documents using tika-server-1.5.jar, it works fine.

      However, in tika-server-1.6.jar and later versions (I've tried 1.6, 1.9 and 1.11), it fails with the stack trace and error shown below.

      I can provide a sample RTF that is failing.

      I wondered whether the error might be related to the following change that was introduced in 1.6?:

      • Made RTFParser's list handling slightly more robust against corrupt
        list metadata (TIKA-1305)

      It's possible that there is some issue with the RTF documents, but they are real patient letters and they open in Microsoft Word without any problems.

      Many thanks
      Ian

      Steps to reproduce issue
      ====================

      1. HTTP PUT to Tika server using curl:

      C:\Downloads\Apache Tika>curl -X PUT --data-binary @test-anonymised-letter.rtf http://localhost:9998/tika --header "Content-Type: application/rtf" --header "Accept: text/plain"

      --> this works fine when running tika-server-1.5.jar, but fails with tika-server-1.6.jar

      2. Screen capture from the server:
      INFO: Starting Apache Tika 1.9 server
      Feb 01, 2016 2:26:10 PM org.apache.cxf.endpoint.ServerImpl initDestination
      INFO: Setting the server's publish address to be http://localhost:9998/
      Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
      INFO: jetty-8.y.z-SNAPSHOT
      Feb 01, 2016 2:26:10 PM org.slf4j.impl.JCLLoggerAdapter info
      INFO: Started SelectChannelConnector@localhost:9998
      Feb 01, 2016 2:26:10 PM org.apache.tika.server.TikaServerCli main
      INFO: Started
      Feb 01, 2016 2:26:24 PM org.apache.tika.server.resource.TikaResource logRequest
      INFO: tika (application/rtf)
      Feb 01, 2016 2:26:25 PM org.apache.tika.server.resource.TikaResource parse
      WARNING: tika: Text extraction failed
      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.rtf.RTFParser@32a6dc
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
      at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:163)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
      at org.apache.tika.server.resource.TikaResource$4.write(TikaResource.java:321)
      at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:164)
      at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1363)
      at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:244)
      at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:117)
      at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:80)
      at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
      at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
      at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
      at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
      at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
      at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
      at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
      at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
      at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
      at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
      at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
      at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
      at org.eclipse.jetty.server.Server.handle(Server.java:370)
      at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
      at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
      at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
      at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
      at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
      at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
      at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
      at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
      at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
      at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
      at java.lang.Thread.run(Unknown Source)
      Caused by: java.lang.NullPointerException
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
      at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
      at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
      at org.apache.tika.parser.rtf.RTFEmbObjHandler.extractObj(RTFEmbObjHandler.java:230)
      at org.apache.tika.parser.rtf.RTFEmbObjHandler.handleCompletedObject(RTFEmbObjHandler.java:198)
      at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1357)
      at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:456)
      at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:439)
      at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:86)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
      ... 34 more

      Feb 01, 2016 2:26:25 PM org.apache.cxf.jaxrs.utils.JAXRSUtils logMessageHandlerProblem
      SEVERE: Problem with writing the data, class org.apache.tika.server.resource.TikaResource$4, ContentType: text/plain

      Attachments

        1. example-that-fails.rtf
          43 kB
          Ian Williams

        Issue Links

          Activity

            People

              tallison Tim Allison
              ianw Ian Williams
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: