Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2986

Edge case (?) in file type detection

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Trivial
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      One of my colleagues, Philip Southam, recently came across a file that was identified as an Acrobat fdf file. The particular file was some kind of binary file with a ".fdf" extension, but not an Acrobat fdf.

      Our current MimeTypes algorithm runs magic first, and then it tries to use the file extension. If the file extension suggests a child mime type of what was found via magic, that is used. The problem with this file was that the magic %FDF- was not found, so from the magic step, it was application/octet, and then the file extension, which was ".fdf", was selected because application/vnd.fdf is a child of application/octet.

      If feels like we might want to add a rule that if a mime definition has a defined magic and that magic is not found, we should not then fall back to the file extension. Or, is there a better way to prevent this from happening? Or, is this just an edge case that we should ignore?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tallison Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: