Created attachment 28901 [details] offending word doc Out of bounds exception occurs (stack trace below) when parsing attached word 97 doc Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@393e6226 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:133) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:400) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) Caused by: java.lang.ArrayIndexOutOfBoundsException: 18 at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163) at org.apache.poi.hwpf.model.Colorref.<init>(Colorref.java:81) at org.apache.poi.hwpf.model.types.SHDAbstractType.fillFields(SHDAbstractType.java:56) at org.apache.poi.hwpf.usermodel.ShadingDescriptor.<init>(ShadingDescriptor.java:38) at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.unCompressCHPOperation(CharacterSprmUncompressor.java:582) at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(CharacterSprmUncompressor.java:65) at org.apache.poi.hwpf.model.StyleSheet.createChp(StyleSheet.java:288) at org.apache.poi.hwpf.model.StyleSheet.<init>(StyleSheet.java:121) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:346) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:77) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more
We're running into this same issue with many of our DOC files. When will it be addressed? Thank you.
Created attachment 29349 [details] Blank DOC file that generates the same error in POI. This file contains no text (completely blank) and it still generates the POI exception: ArrayIndexOutOfBounds
Fixed in trunk. We had incorrect implementation for sprmCShd80 (0x4866) 0x66 processing, Shd was used instead of Shd80
Thanks, Sergey, for fixing this :) Where can we download the latest build? Thanks again!
I believe it will be here https://builds.apache.org/job/POI/lastSuccessfulBuild/artifact/build/dist/ today (USA time?)
Thanks, Sergey. Your build with the fix will probably show up today; the one listed there right now is from 10 September.
(In reply to comment #6) > Thanks, Sergey. Your build with the fix will probably show up today; the one > listed there right now is from 10 September. Hi guys, Looks like the build (46) failed. Any chance of getting one out today? :-)
It was an internal error in Jenkins: FATAL: Cannot find executable from the choosen Ant installation "Ant (latest)" Build step 'Invoke Ant' marked build as failure [WARNINGS] Skipping publisher since build result is FAILURE Archiving artifacts Today's rebuild #47 was successfull. Yegor (In reply to comment #7) > (In reply to comment #6) > > Thanks, Sergey. Your build with the fix will probably show up today; the one > > listed there right now is from 10 September. > > Hi guys, Looks like the build (46) failed. Any chance of getting one out > today? :-)
Thank you, Sergey and Yegor. The issue has been resolved with Build #47: https://builds.apache.org/job/POI/47/
Created attachment 29398 [details] build 47 fixed some but not all of the errors with old word 97 docs. Attached still throws an exception (array out of bounds)
Still some issus with old word 97 docs
Tim, all three files are opened without exceptions. Please try again. Sergey
Thanks for the fix. The latest build (49) is broken right now: https://builds.apache.org/job/POI/49/
There was additional problem with 3rd document provided by Tim. This problem was linked to broken internal structure of lists information in the document (i.e. document was not well-formed). Today I refactored lists processing, and added a "safe-path" to extract text (HTML, FO) information from such documents. All HWPF-tests passed, so need to wait for the next build :)
Created attachment 29416 [details] Bug persists with Word DOC files and latest build (50) The ArrayIndexOutOfBounds bug persists with the latest build (#50) of POI. Please test using the attached blank_2.doc Word DOC file to reproduce.
acougarm, current code doesn't throw any errors on simple file parsing or text extraction. Could you please attach stack trace?
Thanks, Sergey. We downloaded the latest build from here: https://builds.apache.org/job/POI/50/artifact/build/dist/poi-bin-3.9-beta1-20120924.tar.gz Here is the stack trace from a Curl command against Solr, using the above build files: curl "http://localhost:8983/solr/update/extract?extractOnly=true&fmap.content=text" -F "myfile=@blank_2.doc" <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">500</int><int name="QTime">356</in t></lst><lst name="error"><str name="msg">org.apache.tika.exception.TikaExceptio n: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParse r@2c164804</str><str name="trace">org.apache.solr.common.SolrException: org.apac he.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tik a.parser.microsoft.OfficeParser@2c164804 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr actingDocumentLoader.java:230) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co ntentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl erBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle Request(RequestHandlers.java:240) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter .java:454) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte r.java:275) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet Handler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java :484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j ava:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.jav a:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandl er.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandl er.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java: 413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandle r.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandle r.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j ava:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Cont extHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerColl ection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper .java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(Abstrac tHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(Blockin gHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(Abstra ctHttpConnection.java:890) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.header Complete(AbstractHttpConnection.java:944) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpCo nnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(So cketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPoo l.java:599) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool .java:534) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2c164804 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244 ) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 ) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1 20) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr actingDocumentLoader.java:224) ... 31 more Caused by: java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163) at org.apache.poi.hwpf.model.Colorref.<init>(Colorref.java:81) at org.apache.poi.hwpf.model.types.SHDAbstractType.fillFields(SHDAbstrac tType.java:56) at org.apache.poi.hwpf.usermodel.ShadingDescriptor.<init>(ShadingD escriptor.java:38) at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.unCompressCHPOpera tion(CharacterSprmUncompressor.java:582) at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(Char acterSprmUncompressor.java:65) at org.apache.poi.hwpf.model.StyleSheet.createChp(StyleSheet.java:288) at org.apache.poi.hwpf.model.StyleSheet.<init>(StyleSheet.java:121 ) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:346) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.ja va:77) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java :185) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java :160) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 ) ... 34 more </str><int name="code">500</int></lst> </response>
acougarm, it's a stack trace from some old version. Current SVN doesn't have code on CharacterSprmUncompressor.java:582 line neither call to ShadingDescriptor.<init> from CharacterSprmUncompressor::unCompressCHPOperation()
Sorry about that, Sergey! Please attribute this to operator error :) I hadn't replaced all the old POI files, and so some of the previous build files were still lingering around. Once I deleted those, everything working beautifully! Thanks again for your patience.