Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
When trying to detect content type of a zip file with Tika 1.10 in manner like this:
byte[] content = ... // whole zip file. String name = "TR_01.ZIP"; Tika tika = new Tika(); return tika.detect(content, name);
it throws an exception:
java.lang.ArrayIndexOutOfBoundsException: 13 at org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromLocalFileData(X7875_NewUnix.java:199) at org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromCentralDirectoryData(X7875_NewUnix.java:220) at org.apache.commons.compress.archivers.zip.ExtraFieldUtils.parse(ExtraFieldUtils.java:174) at org.apache.commons.compress.archivers.zip.ZipArchiveEntry.setCentralDirectoryExtra(ZipArchiveEntry.java:476) at org.apache.commons.compress.archivers.zip.ZipFile.readCentralDirectoryEntry(ZipFile.java:575) at org.apache.commons.compress.archivers.zip.ZipFile.populateFromCentralDirectory(ZipFile.java:492) at org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:216) at org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:192) at org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:153) at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:141) at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.Tika.detect(Tika.java:155) at org.apache.tika.Tika.detect(Tika.java:183) at org.apache.tika.Tika.detect(Tika.java:223)
The zip file does contain two .jpg images and is not a "special" (JAR, Openoffice, ... ) zip file.
Unfortunately, the contents of the zip file is confidential and so I cannot attach it to this ticket as it is, although I can provide the parameters supplied to
org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromLocalFileData(X7875_NewUnix.java:199) as caught by the debugger:
data = {byte[13]@2103}
0 = 85
1 = 84
2 = 5
3 = 0
4 = 7
5 = -112
6 = -108
7 = 51
8 = 85
9 = 117
10 = 120
11 = 0
12 = 0
offset = 13
length = 0
... it seems the method tries to read more bytes than is actually available in the buffer.
Note that 7zip and unzip can unzip the file without even a warning, so it does not seem like a corrupted file.
Attachments
Issue Links
- relates to
-
TIKA-1949 Upgrade to Commons Compress 1.11
- Resolved