Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.1
    • Component/s: mime
    • Labels:
      None

      Description

      When the mime type of an M4V file is detected using its name only, it returns video/x-m4v. When it is detected using the InputStream (hence utilising the MagicDetector), it incorrectly returns video/quicktime.

      Using the sample M4V file from Apple's knowledge base:

      TikaTest.java
      public class TikaTest {
      
      	public static void main(String[] args) throws Exception {
      		String userHome = System.getProperty("user.home");
      
      		File file = new File(userHome + "/Desktop/sample_iPod.m4v");
      
      		InputStream is = TikaInputStream.get(file);
      
      		Detector detector = new DefaultDetector(
      			MimeTypes.getDefaultMimeTypes());
      
      		Metadata metadata = new Metadata();
      
      		metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
      
      		System.out.println("File + filename: " + detector.detect(is, metadata));
      
      		System.out.println("File only:       " + detector.detect(is, new Metadata()));
      
      		System.out.println("Filename only:   " + detector.detect(null, metadata));
      	}
      
      }
      

      Renders the output:

      File + filename: video/quicktime
      File only:       video/quicktime
      Filename only:   video/x-m4v
      

      Moreover, if the same test is run against an M4A file, the results are even more incorrect:

      File + filename: video/quicktime
      File only:       video/quicktime
      Filename only:   application/octet-stream
      
      1. TIKA-851.patch
        1.0 kB
        Alexander Chow

        Activity

        Hide
        Alexander Chow added a comment -

        Thanks!

        Show
        Alexander Chow added a comment - Thanks!
        Hide
        Nick Burch added a comment -

        I've added the m4b extension to audio/mp4 in r1237113.

        Show
        Nick Burch added a comment - I've added the m4b extension to audio/mp4 in r1237113.
        Hide
        Alexander Chow added a comment - - edited

        Nick, although you add the ftyp for M4B (the bookmarkable format), you don't take into account its extension .m4b. Do you think you can add that?

        Show
        Alexander Chow added a comment - - edited Nick, although you add the ftyp for M4B (the bookmarkable format), you don't take into account its extension .m4b. Do you think you can add that?
        Hide
        Alexander Chow added a comment -

        Thanks Nick for adding the alias.

        Show
        Alexander Chow added a comment - Thanks Nick for adding the alias.
        Hide
        Nick Burch added a comment -

        I've added the audio/x-m4a alias in r1236734.

        Show
        Nick Burch added a comment - I've added the audio/x-m4a alias in r1236734.
        Hide
        Nick Burch added a comment -

        From http://developer.apple.com/library/mac/#documentation/QuickTime/QTFF/QTFFChap1/qtff1.html#//apple_ref/doc/uid/TP40000939-CH203-BBCGDDDF
        "Generally speaking, atoms can be present in any order. Do not conclude that a particular atom is not present until you have parsed all the atoms in the file.

        An exception is the file type atom, which typically identifies the file as a QuickTime movie. If present, this atom precedes any movie atom, movie data, preview, or free space atoms. If you encounter one of these other atom types prior to finding a file type atom, you may assume the file type atom is not present. (This atom is introduced in the QuickTime File Format Specification for 2004, and is not present in QuickTime movie files created prior to 2004)."

        So, if there is a ftyp atom, it should be first, and if the first atom isn't a ftyp then there isn't one. The AtomParsely link is handy, that should help with producing a metadata extracting parser

        Show
        Nick Burch added a comment - From http://developer.apple.com/library/mac/#documentation/QuickTime/QTFF/QTFFChap1/qtff1.html#//apple_ref/doc/uid/TP40000939-CH203-BBCGDDDF "Generally speaking, atoms can be present in any order. Do not conclude that a particular atom is not present until you have parsed all the atoms in the file. An exception is the file type atom, which typically identifies the file as a QuickTime movie. If present, this atom precedes any movie atom, movie data, preview, or free space atoms. If you encounter one of these other atom types prior to finding a file type atom, you may assume the file type atom is not present. (This atom is introduced in the QuickTime File Format Specification for 2004, and is not present in QuickTime movie files created prior to 2004)." So, if there is a ftyp atom, it should be first, and if the first atom isn't a ftyp then there isn't one. The AtomParsely link is handy, that should help with producing a metadata extracting parser
        Hide
        Alexander Chow added a comment -

        Sorry Nick, I didn't notice you update the SVN. It looks like you need to change your mime type though from audio/x-mp4a to audio/x-m4a.

        Show
        Alexander Chow added a comment - Sorry Nick, I didn't notice you update the SVN. It looks like you need to change your mime type though from audio/x-mp4a to audio/x-m4a.
        Hide
        Alexander Chow added a comment -

        I've added a patch file that I think should fix the problem for both M4V and M4A.

        According to AtomParsley, "The ftyp atom is ALWAYS first." This seems to corroborate with Apple's spec discussion on "The Movie Profile Atom".

        Show
        Alexander Chow added a comment - I've added a patch file that I think should fix the problem for both M4V and M4A. According to AtomParsley , "The ftyp atom is ALWAYS first." This seems to corroborate with Apple's spec discussion on "The Movie Profile Atom".
        Hide
        Nick Burch added a comment -

        It looks like most files (not sure if it's all of them though) have a ftyp atom at byte 4. This has "ftyp" followed by a 4 byte (space padded if needed) string of the main type. There's a list of the common ones at http://www.ftyps.com/

        I've added more specific matches for the common types in r1236700. Using the tika-app jar, I can now correctly detect mp4 video, Apple m4v video, mp4 audio and old quicktime movs (using the lower priority fallback)

        I'm not sure if the ftyp atom has to be first or not, if it isn't then this detection won't work. Longer term, a proper file format aware detector would be best, ideally one that can also understand the rest of the format to report on different streams etc

        Show
        Nick Burch added a comment - It looks like most files (not sure if it's all of them though) have a ftyp atom at byte 4. This has "ftyp" followed by a 4 byte (space padded if needed) string of the main type. There's a list of the common ones at http://www.ftyps.com/ I've added more specific matches for the common types in r1236700. Using the tika-app jar, I can now correctly detect mp4 video, Apple m4v video, mp4 audio and old quicktime movs (using the lower priority fallback) I'm not sure if the ftyp atom has to be first or not, if it isn't then this detection won't work. Longer term, a proper file format aware detector would be best, ideally one that can also understand the rest of the format to report on different streams etc
        Hide
        Nick Burch added a comment -

        I'm not sure if we're going to be able to differentiate between .mov, .mp4 and .m4v with only mime magic, as I believe they all use the same container format

        We may need to look at a detector that opens the files up and checks them in a container aware manner

        Show
        Nick Burch added a comment - I'm not sure if we're going to be able to differentiate between .mov, .mp4 and .m4v with only mime magic, as I believe they all use the same container format We may need to look at a detector that opens the files up and checks them in a container aware manner

          People

          • Assignee:
            Unassigned
            Reporter:
            Alexander Chow
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development