Details
Description
When adding MSG files which have attachments we systematically get this error:
ERROR (qtp1013423070-16) [ x:default] o.a.s.s.HttpSolrCall null:org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x0A1A0A0D474E5089, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:162) at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:112) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:111) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:129) at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:238) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2082) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:651) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:458) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:499) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745)
After inspecting SOLR code it seems the problem comes from here:
In the ExtractingDocumentLoader class we have:
context.set(Parser.class, parser);
In our case the parser is an instance of OfficeParser.
When processing an MSG file, the OutlookExtractor class is used by the OfficeParser.
To process the attachments of the MSG file, the OutlookExtractor calls the ParsingEmbeddedDocumentExtractor.
To parse an attachment, the ParsingEmbeddedDocumentExtractor uses the DelegatingParser.
The DelegatingParser determines the parser to use by just looking at the parser set in the context.
protected Parser getDelegateParser(ParseContext context) { return context.get(Parser.class, EmptyParser.INSTANCE); }
So in our case this means that every attachment will be processed with the OfficeParser, even if the attachment is not an MsOffice document !
To make it work correctly, it is an AutoDetectParser that should be set in the context when working with MSG files:
context.set(Parser.class, new AutoDetectParser());