Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.24.1
-
None
-
None
Description
Trying to extract content from 200MBFile.zip and got an exception: TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
Code for reproducing:
public static void main(String[] args) throws Exception { IOUtils.setByteArrayMaxOverride(Integer.MAX_VALUE); FileInputStream fileInputStream = new FileInputStream(new File("200MBFile.doc")); String content = extractContent(fileInputStream); System.out.println(content); } public static String extractContent(InputStream stream) throws IOException, TikaException, SAXException { Parser parser = new AutoDetectParser(); ContentHandler handler = new BodyContentHandler(-1); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); parser.parse(stream, handler, metadata, context); return handler.toString(); }
Actual result:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6b95f8eorg.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6b95f8e at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at com.apacheTikaService.TikaAnalysis.extractContentUsingParser(TikaAnalysis.java:28) at com.apacheTikaService.Servlets.ApacheTikaParserServlet.doPost(ApacheTikaParserServlet.java:22) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder$NotAsync.service(ServletHolder.java:1418) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:763) at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1633) at org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:228) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1609) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1612) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1582) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.Server.handle(Server.java:516) at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383) at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:556) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:422) at java.util.ArrayList.get(ArrayList.java:435) at org.apache.poi.hwpf.usermodel.Range.binarySearchEnd(Range.java:913) at org.apache.poi.hwpf.usermodel.Range.findRange(Range.java:962) at org.apache.poi.hwpf.usermodel.Range.initCharacterRuns(Range.java:857) at org.apache.poi.hwpf.usermodel.Range.numCharacterRuns(Range.java:303) at org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:226) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:733) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:723) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:175) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:176) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44 more
Expected result:
The content was extracted successfully