Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3261

Text file is parsed by "EmptyParser" but the file does contain what looks like valid text

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Bug
    • Affects Version/s: 1.20, 1.24.1
    • Fix Version/s: None
    • Component/s: detector
    • Labels:
      None
    • Environment:

      Description

      I've tried to parse the attached file (please first extract choke.txt from choke.zip to reproduce) using both 1.20 and 1.24.1.  The file appears valid when I view it in my text editor and seems to simply be a tab-delimited table with a mix of Hebrew and Latin characters.   In 1.20 I see an exception thrown, and in 1.24.1 I get JSON metadata back with no content.

      My command line:

      curl -X PUT --upload-file /tmp/choke.txt http://localhost:9998/rmeta/text

      1.24.1  Result:

      {{[

      {"Content-Type":"application/octet-stream","X-Parsed-By":"org.apache.tika.parser.EmptyParser","X-TIKA:embedded_depth":"0","X-TIKA:parse_time_millis":"10"}

      ]}}

       

      1.20 Result:

      INFO Starting Apache Tika 1.20 server
      INFO Setting the server's publish address to be http://localhost:9998/
      INFO Logging initialized @1704ms to org.eclipse.jetty.util.log.Slf4jLog
      INFO jetty-9.4.z-SNAPSHOT; built: 2018-08-30T13:59:14.071Z; git: 27208684755d94a92186989f695db2d7b21ebc51; jvm 8.0.6.10 - pwa6480sr6fp10-20200408_01(SR6 FP10)
      {{INFO Started ServerConnector@7b09f799

      {HTTP/1.1,[http/1.1]} {localhost:9998}

      }}
      INFO Started @2085ms
      WARN Empty contextPath
      {{INFO Started o.e.j.s.h.ContextHandler@-405fdc63

      {/,null,AVAILABLE}

      }}
      INFO Started Apache Tika server at http://localhost:9998/
      INFO rmeta/text (autodetecting type)
      WARN rmeta/text: Text extraction failed (null)
      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.server.resource.TikaResource$1@74f007b
      {{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)}}
      {{ at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)}}
      {{ at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:224)}}
      {{ at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:401)}}
      {{ at org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)}}
      {{ at org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)}}
      {{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}}
      {{ at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:90)}}
      {{ at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)}}
      {{ at java.lang.reflect.Method.invoke(Method.java:508)}}
      {{ at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)}}
      {{ at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)}}
      {{ at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:193)}}
      {{ at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:103)}}
      {{ at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)}}
      {{ at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)}}
      {{ at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)}}
      {{ at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)}}
      {{ at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)}}
      {{ at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)}}
      {{ at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)}}
      {{ at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)}}
      {{ at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)}}
      {{ at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1340)}}
      {{ at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)}}
      {{ at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1242)}}
      {{ at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)}}
      {{ at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)}}
      {{ at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)}}
      {{ at org.eclipse.jetty.server.Server.handle(Server.java:503)}}
      {{ at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)}}
      {{ at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)}}
      {{ at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)}}
      {{ at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)}}
      {{ at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)}}
      {{ at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)}}
      {{ at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)}}
      {{ at java.lang.Thread.run(Thread.java:820)}}
      Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type
      {{ at org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.java:127)}}
      {{ at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)}}
      {{ ... 37 more}}

       

        Attachments

        1. choke.zip
          248 kB
          Josh Burchard

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jmbox80 Josh Burchard
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: