Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2986

Edge case (?) in file type detection

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Trivial
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      One of my colleagues, Philip Southam, recently came across a file that was identified as an Acrobat fdf file. The particular file was some kind of binary file with a ".fdf" extension, but not an Acrobat fdf.

      Our current MimeTypes algorithm runs magic first, and then it tries to use the file extension. If the file extension suggests a child mime type of what was found via magic, that is used. The problem with this file was that the magic %FDF- was not found, so from the magic step, it was application/octet, and then the file extension, which was ".fdf", was selected because application/vnd.fdf is a child of application/octet.

      If feels like we might want to add a rule that if a mime definition has a defined magic and that magic is not found, we should not then fall back to the file extension. Or, is there a better way to prevent this from happening? Or, is this just an edge case that we should ignore?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: