Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2591

Some tiffs (Big Endian with fax compression) are showing up as x-tarr

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.16
    • 1.18, 2.0.0
    • core
    • Tika, running in a java application and a unit-test (windows and mac environments)

    Description

      I have found that a certain tiff that we manage is now reporting application/x-tar in Tika where it previously reported as a tiff (image/tiff). 

      Observe this code in ArchiveStreamFactory, detect method.

        // COMPRESS-117 - improve auto-recognition

              if (signatureLength >= TAR_HEADER_SIZE) {

                  TarArchiveInputStream tais = null;

                  try {

                      tais = new TarArchiveInputStream(new ByteArrayInputStream(tarHeader));

                      // COMPRESS-191 - verify the header checksum

                      if (tais.getNextTarEntry().isCheckSumOK())

      {                     return TAR;                 }

                  } catch (final Exception e)

      { // NOPMD // NOSONAR                 // can generate IllegalArgumentException as well                 // as IOException                 // autodetection, simply not a TAR                 // ignored             }

      finally

      {                 IOUtils.closeQuietly(tais);             }

      What if find is that most TIFs, when they get to tais.getNextTarEntry() fail with an exception (i.e fall into the "simply not a tar" case). However this tiff actually does NOT fail here. This somewhat makes sense as the internal structure of a fax compressed tifs as a tar-like structure

      Note, the CompositeDetector class eventually does recognize it as a proper tiff as it loops through its detectors in its detect method. It is detected as tiff in the MimeTypes class, which is one of the implementations of the Detector interface

       

          public MediaType detect(InputStream input, Metadata metadata)

                  throws IOException {

              MediaType type = MediaType.OCTET_STREAM;

              for (Detector detector : getDetectors()) {

                  //short circuit via OverrideDetector

                  //can't rely on ordering because subsequent detector may

                  //change Override's to a specialization of Override's

                  if (detector instanceof OverrideDetector &&        metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null)

      {                 return detector.detect(input, metadata);             }

                  MediaType detected = detector.detect(input, metadata);

                  if (registry.isSpecializationOf(detected, type))

      {                 type = detected;             }

              }

              return type;

      However since Image/tiff isn't a specialization of application/x-tar it does not replace the type with tiff.

      My fix was to add a  "<sub-class-of type="application/x-tar"/>" to the definition for image/tiff in the tika-mimetypes.xml file

       

       

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            schmiddc daniel schmidt
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment