When trying to extract text from a visio file I encounter a java.lang.NegativeArraySizeException which is caused by a negative length of a ChunkHeader. That seems to be caused by casting the result of LittleEndian.getUInt(byte[],int) to int. Stacktrace: java.lang.NegativeArraySizeException at org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:155) at org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:54) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:92) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:99) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:99) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:89) at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:44) at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:48) at org.raikoeckstein.search.indexer.handlingtypes.visio.POIVisioHandler.getDocument(POIVisioHandler.java:35) at test.org.forflow.search.indexer.handlingtypes.POIVisioHandlerTest.testGetDocument(POIVisioHandlerTest.java:55) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.junit.internal.runners.TestMethodRunner.executeMethodBody(TestMethodRunner.java:99) at org.junit.internal.runners.TestMethodRunner.runUnprotected(TestMethodRunner.java:81) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod(TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run(TestMethodRunner.java:45) at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod(TestClassMethodsRunner.java:71) at org.junit.internal.runners.TestClassMethodsRunner.run(TestClassMethodsRunner.java:35) at org.junit.internal.runners.TestClassRunner$1.runUnprotected(TestClassRunner.java:42) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestClassRunner.run(TestClassRunner.java:52) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:38) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
Can up upload the problem file? That way we can check to see if there are any other issues, and we'll have a testcase to ensure we don't break this again once it's fixed :)
Created attachment 21022 [details] Problematic visio file
Java Arrays need to be indexed by an int, not a long, so if the chunk length really was that large we'd be stuff anyway. Looking at the header for that chunk, all the values look really really large. I'm not sure if the problem is that we're de-compressing the stream incorrectly (it's in a compressed stream), or if we're getting the size of a previous chunk wrong (so we wind on the wrong amount to get to this chunk) It's going to need some more investigating, probably comparing lots of things with vsdump, but that'll have to happen another time :/
*** Bug 44596 has been marked as a duplicate of this bug. ***
Created attachment 21769 [details] The attachment that
Created attachment 21770 [details] The attachment thats causing this error
This should now be fixed
still throws an exception
I still get this bug with POI 3.1 and 3.2.
Same problem in 3.5-beta7-20090630
Created attachment 23910 [details] file that causes exception
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@15ee671 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:85) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:116) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57) Caused by: java.lang.IllegalArgumentException: Found a chunk with a negative length, which isn't allowed at org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:120) at org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:59) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:93) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:98) at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:52) at org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:49) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
I also still see this exact exception on my 4 .vsd files I put through 3.5 beta 6. java.lang.IllegalArgumentException: Found a chunk with a negative length, which isn't allowed at org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:120) at org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:59) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:93) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:98) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:59) and something slightly different: Needed 19 bytes to create the next chunk header, but only found 4 bytes, ignoring rest of data Needed 19 bytes to create the next chunk header, but only found 4 bytes, ignoring rest of data Needed 19 bytes to create the next chunk header, but only found 4 bytes, ignoring rest of data Needed 19 bytes to create the next chunk header, but only found 4 bytes, ignoring rest of data java.lang.IllegalArgumentException: Found a chunk with a negative length, which isn't allowed at org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:120) at org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:59) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:93) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100) at org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:98) at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:59)
That's a bugger. Speaking of POI 3.5 beta 6, did it pass all the other tests? I think the last one I tried was before that one. Have been waiting for them to produce one which doesn't fail our existing tests so that we can at least upgrade to the beta and pass a few more which have been shelved for new features which are only in 3.5.
Ack. That comment wasn't supposed to be here... sorry about that. (and about this redundant one too.)
Is there an update on this issue? Thanks.
I just tested this and can verify that the problem still exists in POI 3.6-FINAL. The original NegativeArraySizeException is just replaced with an IllegalArgumentException.
Created attachment 24895 [details] Proposed patch I inspected the troublesome files and the byte patterns in the chunk streams seem to indicate that the parsing logic is not correctly detecting some separator bytes. The proposed patch adds the missing logic for all the misdiagnosed entries in the attached example files, though it seems likely that there are also other cases out there where the current logic would fail. Without better information about the semantics of the chunk header fields it's hard to do anything better. With this patch all the attached files get parsed without problems. The patch also contains a change to the chunks_parse_cmds.tbl file for avoiding incorrect parsing of a chunk in attachment 21770 [details]. The entry that I commented out seemed vague in the first place, so I don't believe this change will cause (m)any regressions.
Thanks for investigating this in detail Jukka I've applied your patch for the v11 chunk header. As vsdump didn't have an issue with the short string on type 45 / format 52, I decided to just have a string length chunk, and treat those cases as an empty string The result is that we can extract text without error from the files! :)
*** Bug 44781 has been marked as a duplicate of this bug. ***