Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4351

More restrictive MIME type validation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • core, mime
    • None

    Description

      Background:

      While looking at more examples, digging deeper and trying to improve the detection code in Nutch, I came up with the following points regarding the validation of the MIME type in MimeTypes#forName. The method is used both from Nutch and Tika (in MimeTypes#detect(...)):

      • "forName" accepts non-ASCII Unicode characters as part of the MIME type (foo/bär) - not covered by RFC 2045 which allows only US_ASCII characters. Of course, one might argue, that already the HTTP header parser should filter such headers away, but ...
      • the grammar in RFC 2045 is lazy interpreted, that is a type or subtype may include the allowed characters in any order
        • (sub)types not registered at IANA are accepted even if they do not start with "x-" / "X-" / "x."
        • RFC 6838 is more restrictive, e.g.,
          • (sub)types are required to start with a letter or number
          • fewer non-letter/number characters are allowed
      • Nutch passes the Content-Type HTTP header value and the URL as metadata hints to MimeTypes.detect(inputstream, metadata). This helped to improve the detection especially for types which are subclasses of application/zip. At least, in the past, this was necessary to handle various Office document types.

      Attachments

        Activity

          People

            Unassigned Unassigned
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: