Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1692

Enable getExtension() for mime strings with parameters

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: core
    • Labels:
      None

      Description

      getExtension() offers a handy way to add a "detected" extension from a MimeType for a file that didn't come with an extension. However, this functionality doesn't work with texty files: html, xml, css, csv, etc.

      Let's add a static helper class (or build it into MimeType?) that will output an extension for all mime types including texty mime types.

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #798 (See https://builds.apache.org/job/tika-trunk-jdk1.7/798/)
        TIKA-1692 : allow MimeTypes to look for a registered mime type that may or may not have parameters. (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1692283)

        • /tika/trunk/CHANGES.txt
        • /tika/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java
        • /tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #798 (See https://builds.apache.org/job/tika-trunk-jdk1.7/798/ ) TIKA-1692 : allow MimeTypes to look for a registered mime type that may or may not have parameters. (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1692283 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java /tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        r1692283.

        Thank you, Nick Burch!

        Show
        tallison@mitre.org Tim Allison added a comment - r1692283. Thank you, Nick Burch !
        Hide
        gagravarr Nick Burch added a comment -

        I think loosing those parameters on the Mime Type (but not the Media Type) is correct. If you want the full details, stay in the Media Type world. If you want the details as defined in the well-known mime types database, then loosing parameters which aren't in the database seems OK to me

        // Media Type always keeps details / parameters
        String name = "application/xml; charset=UTF-8";
        MediaType mt = MediaType.parse(name);
        assertEquals(name, mt.toString());
        
        // Mime type looses details not in the file
        MimeType mimeType = types.getRegisteredMimeType(name);
        assertEquals("application/xml", mimeType.toString());
        assertEquals(".xml", mimeType.getExtension());
        
        // But on well-known parameters stays
        name = "application/dita+xml;format=map";
        mt = MediaType.parse(name);
        assertEquals(name, mt.toString());
        mimeType = types.getRegisteredMimeType(name);
        assertEquals(name, mimeType.toString());
        assertEquals(".ditamap", mimeType.getExtension());
        
        Show
        gagravarr Nick Burch added a comment - I think loosing those parameters on the Mime Type (but not the Media Type) is correct. If you want the full details, stay in the Media Type world. If you want the details as defined in the well-known mime types database, then loosing parameters which aren't in the database seems OK to me // Media Type always keeps details / parameters String name = "application/xml; charset=UTF-8" ; MediaType mt = MediaType.parse(name); assertEquals(name, mt.toString()); // Mime type looses details not in the file MimeType mimeType = types.getRegisteredMimeType(name); assertEquals( "application/xml" , mimeType.toString()); assertEquals( ".xml" , mimeType.getExtension()); // But on well-known parameters stays name = "application/dita+xml;format=map" ; mt = MediaType.parse(name); assertEquals(name, mt.toString()); mimeType = types.getRegisteredMimeType(name); assertEquals(name, mimeType.toString()); assertEquals( ".ditamap" , mimeType.getExtension());
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Hmmm....
        If we modify getRegisteredMimeType to this:

                if (type != null) {
                    MediaType normalisedType = registry.normalize(type);
                    MimeType candidate = types.get(normalisedType);
                    if (candidate != null) {
                        return candidate;
                    }
                    if (normalisedType.hasParameters()) {
                        return types.get(normalisedType.getBaseType());
                    }
                    return null;
                } else {
                    throw new MimeTypeException("Invalid media type name: " + name);
                }
        

        then we lose the parameters in the returned value:

            @Test
            public void testGetExtensionForMimesWithParameters() throws Exception {
                MimeType mt = this.mimeTypes.getRegisteredMimeType("text/html; charset=UTF-8");
                assertEquals("text/html", mt.toString());
                assertEquals("text/html", mt.getName());
                assertEquals(".html", mt.getExtension());
        

        I don't think this is what you were expecting in your test above, however, I guess it could make sense. If you want the one that is actually registered, it often isn't the one with parameters. However if you want the full MimeType from a string, use parse.

        Another option is to move this logic into a static getExtension(String) and/or getExtension(MediaType)...

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Hmmm.... If we modify getRegisteredMimeType to this: if (type != null) { MediaType normalisedType = registry.normalize(type); MimeType candidate = types.get(normalisedType); if (candidate != null) { return candidate; } if (normalisedType.hasParameters()) { return types.get(normalisedType.getBaseType()); } return null; } else { throw new MimeTypeException("Invalid media type name: " + name); } then we lose the parameters in the returned value: @Test public void testGetExtensionForMimesWithParameters() throws Exception { MimeType mt = this.mimeTypes.getRegisteredMimeType("text/html; charset=UTF-8"); assertEquals("text/html", mt.toString()); assertEquals("text/html", mt.getName()); assertEquals(".html", mt.getExtension()); I don't think this is what you were expecting in your test above, however, I guess it could make sense. If you want the one that is actually registered, it often isn't the one with parameters. However if you want the full MimeType from a string, use parse . Another option is to move this logic into a static getExtension(String) and/or getExtension(MediaType)...
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Y. That's perfect. Will give it a try.

        Show
        tallison@mitre.org Tim Allison added a comment - Y. That's perfect. Will give it a try.
        Hide
        gagravarr Nick Burch added a comment -

        You'd get something similar with a type of application/vnd.ms-excel and a hypothetical application/vnd.ms-excel; version=10, the latter being a child of the former. The former is known in the mime types file, and so has lots of details, the latter is a newly-registered child of it which was created by the call to types.forName

        Maybe we want to tweak the getRegisteredMimeType method, so it would try the with-parameters type first, without-parameters second, and null if not? (Some of our defined mimetypes in the file do have parameters, so we can't just ignore them). This one doesn't register now, and returns null if not known, we'd just add the "try dropping parameters if you don't know them" logic

        You could then do something like

        String name = "application/xml; charset=UTF-8";
        MimeType mimeType = types.getRegisteredMimeType(name);
        if (mimeType != null) {
               assertEquals("xml", mimeType.getExtension());
               assertEquals(name, mimeType.toString());
        } else {
             System.err.println("Sorry, this type isn't one we know about: " + name);
        }
        
        Show
        gagravarr Nick Burch added a comment - You'd get something similar with a type of application/vnd.ms-excel and a hypothetical application/vnd.ms-excel; version=10 , the latter being a child of the former. The former is known in the mime types file, and so has lots of details, the latter is a newly-registered child of it which was created by the call to types.forName Maybe we want to tweak the getRegisteredMimeType method, so it would try the with-parameters type first, without-parameters second, and null if not? (Some of our defined mimetypes in the file do have parameters, so we can't just ignore them). This one doesn't register now, and returns null if not known, we'd just add the "try dropping parameters if you don't know them" logic You could then do something like String name = "application/xml; charset=UTF-8" ; MimeType mimeType = types.getRegisteredMimeType(name); if (mimeType != null ) { assertEquals( "xml" , mimeType.getExtension()); assertEquals(name, mimeType.toString()); } else { System .err.println( "Sorry, this type isn't one we know about: " + name); }
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Nothing like unit tests... So, all is well for straight mime strings for texty files; however, if there is an encoding attached (as we currently receive in the Metadata from an auto-detected document), we run into the problem that initially inspired this issue.

            @Test
            public void testCurrent() throws Exception {
                MimeTypes types = config.getMimeRepository();
        
                assertEquals("application/xml", MediaType.APPLICATION_XML.toString());
                MimeType mimeType = types.forName(MediaType.APPLICATION_XML.toString());
                assertEquals(".xml", mimeType.getExtension());
        
                mimeType = types.forName("application/xml; charset=UTF-8");
                assertEquals("", mimeType.getExtension());
                assertEquals("application/xml; charset=UTF-8", mimeType.toString());
            }
        
        Show
        tallison@mitre.org Tim Allison added a comment - - edited Nothing like unit tests... So, all is well for straight mime strings for texty files; however, if there is an encoding attached (as we currently receive in the Metadata from an auto-detected document), we run into the problem that initially inspired this issue. @Test public void testCurrent() throws Exception { MimeTypes types = config.getMimeRepository(); assertEquals("application/xml", MediaType.APPLICATION_XML.toString()); MimeType mimeType = types.forName(MediaType.APPLICATION_XML.toString()); assertEquals(".xml", mimeType.getExtension()); mimeType = types.forName("application/xml; charset=UTF-8"); assertEquals("", mimeType.getExtension()); assertEquals("application/xml; charset=UTF-8", mimeType.toString()); }
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        So the use case is: you've already done file type identification to figure out the MimeType and now you want to save the extension-less file with a human-friendly extension.

        The current getExtension() on MimeTypes that are text/html returns an empty string. Is this a bug...or did I miss how to do this properly?

        I have no doubt that I may be reinventing the wheel.

        Thank you, Nick Burch!

        Show
        tallison@mitre.org Tim Allison added a comment - - edited So the use case is: you've already done file type identification to figure out the MimeType and now you want to save the extension-less file with a human-friendly extension. The current getExtension() on MimeTypes that are text/html returns an empty string. Is this a bug...or did I miss how to do this properly? I have no doubt that I may be reinventing the wheel. Thank you, Nick Burch !
        Hide
        gagravarr Nick Burch added a comment -

        Could you write a short unit test that shows the problem? As long as we can identify the type, we ought to be able to report the glob / globs for that type, whether application/ or video/ or text/

        Show
        gagravarr Nick Burch added a comment - Could you write a short unit test that shows the problem? As long as we can identify the type, we ought to be able to report the glob / globs for that type, whether application/ or video/ or text/

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development