Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1717

Tika throws exception on detecting content-type of a zip file

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When trying to detect content type of a zip file with Tika 1.10 in manner like this:

              byte[] content = ... // whole zip file.
              String name = "TR_01.ZIP";
              Tika tika = new Tika();
              return tika.detect(content, name);
      

      it throws an exception:

      java.lang.ArrayIndexOutOfBoundsException: 13
      	at org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromLocalFileData(X7875_NewUnix.java:199)
      	at org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromCentralDirectoryData(X7875_NewUnix.java:220)
      	at org.apache.commons.compress.archivers.zip.ExtraFieldUtils.parse(ExtraFieldUtils.java:174)
      	at org.apache.commons.compress.archivers.zip.ZipArchiveEntry.setCentralDirectoryExtra(ZipArchiveEntry.java:476)
      	at org.apache.commons.compress.archivers.zip.ZipFile.readCentralDirectoryEntry(ZipFile.java:575)
      	at org.apache.commons.compress.archivers.zip.ZipFile.populateFromCentralDirectory(ZipFile.java:492)
      	at org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:216)
      	at org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:192)
      	at org.apache.commons.compress.archivers.zip.ZipFile.<init>(ZipFile.java:153)
      	at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:141)
      	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
      	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
      	at org.apache.tika.Tika.detect(Tika.java:155)
      	at org.apache.tika.Tika.detect(Tika.java:183)
      	at org.apache.tika.Tika.detect(Tika.java:223)
      

      The zip file does contain two .jpg images and is not a "special" (JAR, Openoffice, ... ) zip file.

      Unfortunately, the contents of the zip file is confidential and so I cannot attach it to this ticket as it is, although I can provide the parameters supplied to
      org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromLocalFileData(X7875_NewUnix.java:199) as caught by the debugger:

      data = {byte[13]@2103}
       0 = 85
       1 = 84
       2 = 5
       3 = 0
       4 = 7
       5 = -112
       6 = -108
       7 = 51
       8 = 85
       9 = 117
       10 = 120
       11 = 0
       12 = 0
      offset = 13
      length = 0
      

      ... it seems the method tries to read more bytes than is actually available in the buffer.
      Note that 7zip and unzip can unzip the file without even a warning, so it does not seem like a corrupted file.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mpl Martin Petricek
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: