Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3214

Tika Fails to extract content from MS Word

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.24.1
    • None
    • parser
    • None

    Description

      Trying to extract content from 200MBFile.zip and got an exception: TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser

      Code for reproducing:
          public static void main(String[] args) throws Exception {
              IOUtils.setByteArrayMaxOverride(Integer.MAX_VALUE);
              FileInputStream fileInputStream = new FileInputStream(new File("200MBFile.doc"));
              String content = extractContent(fileInputStream);
              System.out.println(content);
          }
      
          public static String extractContent(InputStream stream)
                  throws IOException, TikaException, SAXException {
              Parser parser = new AutoDetectParser();
              ContentHandler handler = new BodyContentHandler(-1);
              Metadata metadata = new Metadata();
              ParseContext context = new ParseContext();
              parser.parse(stream, handler, metadata, context);
              return handler.toString();
          }
      
      Actual result:
      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6b95f8eorg.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6b95f8e at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at com.apacheTikaService.TikaAnalysis.extractContentUsingParser(TikaAnalysis.java:28) at com.apacheTikaService.Servlets.ApacheTikaParserServlet.doPost(ApacheTikaParserServlet.java:22) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder$NotAsync.service(ServletHolder.java:1418) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:763) at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1633) at org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:228) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1609) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1612) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1582) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.Server.handle(Server.java:516) at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383) at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:556) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:422) at java.util.ArrayList.get(ArrayList.java:435) at org.apache.poi.hwpf.usermodel.Range.binarySearchEnd(Range.java:913) at org.apache.poi.hwpf.usermodel.Range.findRange(Range.java:962) at org.apache.poi.hwpf.usermodel.Range.initCharacterRuns(Range.java:857) at org.apache.poi.hwpf.usermodel.Range.numCharacterRuns(Range.java:303) at org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:226) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:733) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:723) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:175) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:176) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44 more
      
      Expected result:

      The content was extracted successfully 

      Related bug:

      https://bz.apache.org/bugzilla/show_bug.cgi?id=64853

      Attachments

        1. 200MBFile.zip
          881 kB
          Sergey Smolyakov

        Activity

          People

            Unassigned Unassigned
            maslbl4 Sergey Smolyakov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: