Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4220

Commons-compress too lenient on headless tar detection

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 3.0.0, 2.9.3
    • None
    • None

    Description

      On recent regression tests on TIKA-4218, we noticed a fairly major change with an increased rate of false positives on headless tar detection from commons-compress.

      I think for now we should copy/paste/fork the headless tar detection and improve it/revert it or possibly remove it for our 2.9.2 release.

      On this ticket, I'll look into what changed recently in headless tar detection in commons-compress and experiment with fixing it.

      One challenge is that our magic bytes detection happens after our custom detectors, which means that we can't put a low confidence on what comes out of our custom detectors and let the magic detection fix it. We could implement an x-tar special case, but I really don't like that.

      Let's see what we can do...

      The numbers below represent the number of files identified as A (in tika 2.9.1) -> B (in tika-2.9.2-pre-rc1).

      application/octet-stream -> application/x-tar 826
      multipart/appledouble -> application/x-tar 701
      image/x-tga -> application/x-tar 322
      image/vnd.microsoft.icon -> application/x-tar 312
      application/vnd.iccprofile -> application/x-tar 221
      video/mp4 -> application/x-tar 177
      audio/mpeg -> application/x-tar 59
      video/x-m4v -> application/x-tar 59
      application/x-font-printer-metric -> application/x-tar 36
      audio/mp4 -> application/x-tar 25
      application/x-tex-tfm -> application/x-tar 18
      image/x-pict -> application/x-tar 15
      image/png -> application/x-tar 8
      text/plain; charset=ISO-8859-1 -> application/x-tar 8
      application/x-endnote-style -> application/x-tar 7
      application/x-font-ttf -> application/x-tar 6

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: