Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11142

NotOLE2FileException when adding MSG files with attachments

    XMLWordPrintableJSON

    Details

      Description

      When adding MSG files which have attachments we systematically get this error:

      ERROR (qtp1013423070-16) [   x:default] o.a.s.s.HttpSolrCall null:org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x0A1A0A0D474E5089, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
      	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:162)
      	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:112)
      	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302)
      	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:111)
      	at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
      	at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
      	at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:129)
      	at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:238)
      	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
      	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
      	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
      	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
      	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
      	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2082)
      	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:651)
      	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:458)
      	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
      	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
      	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
      	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
      	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
      	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
      	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
      	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
      	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
      	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
      	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
      	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
      	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
      	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
      	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
      	at org.eclipse.jetty.server.Server.handle(Server.java:499)
      	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
      	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
      	at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
      	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
      	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
      	at java.lang.Thread.run(Thread.java:745)
      

      After inspecting SOLR code it seems the problem comes from here:

      In the ExtractingDocumentLoader class we have:

      context.set(Parser.class, parser);
      

      In our case the parser is an instance of OfficeParser.

      When processing an MSG file, the OutlookExtractor class is used by the OfficeParser.

      To process the attachments of the MSG file, the OutlookExtractor calls the ParsingEmbeddedDocumentExtractor.

      To parse an attachment, the ParsingEmbeddedDocumentExtractor uses the DelegatingParser.

      The DelegatingParser determines the parser to use by just looking at the parser set in the context.

       protected Parser getDelegateParser(ParseContext context) {
              return context.get(Parser.class, EmptyParser.INSTANCE);
          }
      

      So in our case this means that every attachment will be processed with the OfficeParser, even if the attachment is not an MsOffice document !

      To make it work correctly, it is an AutoDetectParser that should be set in the context when working with MSG files:

      context.set(Parser.class, new AutoDetectParser());
      

        Attachments

        1. test.msg
          25 kB
          Olivier Masseau

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              maol Olivier Masseau
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: