Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-835

TNEF parsing unstable

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.0
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None
    • Environment:

      CentOS 4.x/5.x/6.x
      Java 6

      Description

      We are seeing problems in Solr with tika throwing exceptions. Sometimes we see OOM like this:

      SEVERE: java.lang.OutOfMemoryError: Java heap space
              at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
              at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
              at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
              at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
              at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
              at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
              at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
              at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
              at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
              at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
              at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
              at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
              at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
              at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
              at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
              at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
              at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
              at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
              at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
              at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
              at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
              at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
              at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
              at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
              at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
      

      Other times, we see errors like this one:

      Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
              at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
              at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
              at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
              at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
              at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
              at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
              at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
              at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
              at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
              ... 26 more
      

      I am able to reproduce these failures with tika-app-1.0.jar. I am not able to share the content as the content is proprietary in nature. The OOM error is particularly problematic as it crashes Solr and causes our document indexing pipeline to get congested while it waits for Solr to restart. Please see also Solr ticket https://issues.apache.org/jira/browse/SOLR-2990 as this contains the original posting of the problem and some details of our environment where the tests are being performed.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rtulloh Rob Tulloh
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: