Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1955 MIME types updates and additions for Scientific Data based on TREC-DD-Polar
  3. TIKA-1882

Scientific MIME updates to .cab files, .xar and .mobi and .mov files based on TREC-DD-Polar analysis

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.13
    • Component/s: mime
    • Labels:
    • Flags:
      Patch

      Description

      The following mime magic can be added to better detect the below mime-types:

      1. vnd.ms-cab-compressed (.cab files) - pattern "MCSF" in the first 4 bytes
      2. application/vnd.xara (.xar files) - pattern "xar!" in the first 4 bytes
      3. application/x-mobipocket-ebook (.mobi files) - pattern "BOOKMOBI" starting at byte position 60
      4. video/quicktime (.mov files) - patterns "free" and "wide" seen starting at byte position 4

      The changes can be seen here:
      https://github.com/mkampasi/tika/commit/f7433daf434a44937ba3ae8b15813a768f95e334

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user mkampasi opened a pull request:

          https://github.com/apache/tika/pull/82

          Fix for TIKA-1882

          The following mime magic has been added to tika-mimetypes.xml to better detect the below mime-types:

          1. *application/vnd.ms-cab-compressed (.cab files)* - pattern "MCSF" in the first 4 bytes
          2. *application/vnd.xara (.xar files)* - pattern "xar!" in the first 4 bytes
          3. *application/x-mobipocket-ebook (.mobi files)* - pattern "BOOKMOBI" starting at byte position 60
          4. *video/quicktime (.mov files)* - patterns "free" and "wide" seen starting at byte position 4

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/mkampasi/tika master

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/82.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #82


          commit f7433daf434a44937ba3ae8b15813a768f95e334
          Author: Manisha Kampasi <manishakampasi22@gmail.com>
          Date: 2016-03-01T07:02:55Z

          Update tika-mimetypes.xml

          Updated mime-magic for 4 mime types (tika-mimetypes.xml):
          1. vnd.ms-cab-compressed (.cab files) - pattern "MCSF" in the first 4 bytes
          2. application/vnd.xara (.xar files) - pattern "xar!" in the first 4 bytes
          3. application/x-mobipocket-ebook (.mobi files) - pattern "BOOKMOBI" starting at byte position 60
          4. video/quicktime (.mov files) - patterns "free" and "wide" seen starting at byte position 4


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user mkampasi opened a pull request: https://github.com/apache/tika/pull/82 Fix for TIKA-1882 The following mime magic has been added to tika-mimetypes.xml to better detect the below mime-types: 1. * application/vnd.ms-cab-compressed (.cab files) * - pattern "MCSF" in the first 4 bytes 2. * application/vnd.xara (.xar files) * - pattern "xar!" in the first 4 bytes 3. * application/x-mobipocket-ebook (.mobi files) * - pattern "BOOKMOBI" starting at byte position 60 4. * video/quicktime (.mov files) * - patterns "free" and "wide" seen starting at byte position 4 You can merge this pull request into a Git repository by running: $ git pull https://github.com/mkampasi/tika master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/82.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #82 commit f7433daf434a44937ba3ae8b15813a768f95e334 Author: Manisha Kampasi <manishakampasi22@gmail.com> Date: 2016-03-01T07:02:55Z Update tika-mimetypes.xml Updated mime-magic for 4 mime types (tika-mimetypes.xml): 1. vnd.ms-cab-compressed (.cab files) - pattern "MCSF" in the first 4 bytes 2. application/vnd.xara (.xar files) - pattern "xar!" in the first 4 bytes 3. application/x-mobipocket-ebook (.mobi files) - pattern "BOOKMOBI" starting at byte position 60 4. video/quicktime (.mov files) - patterns "free" and "wide" seen starting at byte position 4
          Hide
          gagravarr Nick Burch added a comment -

          I'm not sure the quicktime pattern is correct - I have some MOV files without either there, and some MP4s which do have it. (MP4 and Quicktime MOV are related formats)

          Show
          gagravarr Nick Burch added a comment - I'm not sure the quicktime pattern is correct - I have some MOV files without either there, and some MP4s which do have it. (MP4 and Quicktime MOV are related formats)
          Hide
          kampasi@usc.edu Manisha Kampasi added a comment -

          Hi Nick,

          I based my analysis on the following sources of information:
          1. http://www.opensource.apple.com/source/file/file-23/file/magic/magic.mime
          2. http://www.filesignatures.net/index.php?search=MOV&mode=EXT
          3. http://www.garykessler.net/library/file_sigs.html

          I did not find these patterns in MP4 files of the data set that I am working with. However, since you did, it seems like these patterns are not good indicators of one container over the other and can be removed.

          Thanks,
          Manisha

          Show
          kampasi@usc.edu Manisha Kampasi added a comment - Hi Nick, I based my analysis on the following sources of information: 1. http://www.opensource.apple.com/source/file/file-23/file/magic/magic.mime 2. http://www.filesignatures.net/index.php?search=MOV&mode=EXT 3. http://www.garykessler.net/library/file_sigs.html I did not find these patterns in MP4 files of the data set that I am working with. However, since you did, it seems like these patterns are not good indicators of one container over the other and can be removed. Thanks, Manisha
          Hide
          gagravarr Nick Burch added a comment -

          Just because other people think it's a magic doesn't necessarily mean it is - many others just blindly find a few bytes that look common without trying to understand the underlying format, and consequently can get it wrong...

          As the QuickTime container is a base for MP4, and our MP4 Video mime type declares QuickTime Video as its parent, if things are common then QuickTime is the right place to put it.

          I've had a go in bee1a87d7d9ad3a1c5f45cf65082b9505dbe9fc0 to better express the QuickTime/MP4 relationship in the mime types hierarchy. If you could merge that and re-test, and all tests pass, plus switch hex strings to text where possible (see pull request comments) then I think we should be fine to apply

          Show
          gagravarr Nick Burch added a comment - Just because other people think it's a magic doesn't necessarily mean it is - many others just blindly find a few bytes that look common without trying to understand the underlying format, and consequently can get it wrong... As the QuickTime container is a base for MP4, and our MP4 Video mime type declares QuickTime Video as its parent, if things are common then QuickTime is the right place to put it. I've had a go in bee1a87d7d9ad3a1c5f45cf65082b9505dbe9fc0 to better express the QuickTime/MP4 relationship in the mime types hierarchy. If you could merge that and re-test, and all tests pass, plus switch hex strings to text where possible (see pull request comments) then I think we should be fine to apply
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/82

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/82
          Hide
          chrismattmann Chris A. Mattmann added a comment -
          LMC-053601:tika1.13 mattmann$ git push -u origin master
          Counting objects: 14, done.
          Delta compression using up to 8 threads.
          Compressing objects: 100% (10/10), done.
          Writing objects: 100% (14/14), 1.26 KiB | 0 bytes/s, done.
          Total 14 (delta 5), reused 0 (delta 0)
          remote: tika git commit: Record change for TIKA-1882 this closes #82.
          remote: tika git commit: Fix for TIKA-1882: .cab, .xar, .mobi and .mov files from the TREC-DD-Polar dataset. This closes #82.
          To https://git-wip-us.apache.org/repos/asf/tika.git
             f61a4ed..3d59471  master -> master
          Branch master set up to track remote branch master from origin.
          LMC-053601:tika1.13 mattmann$ 
          
          Show
          chrismattmann Chris A. Mattmann added a comment - applied updated MIME per Nick Burch comments and Manisha Kampasi work. Thanks! LMC-053601:tika1.13 mattmann$ git push -u origin master Counting objects: 14, done. Delta compression using up to 8 threads. Compressing objects: 100% (10/10), done. Writing objects: 100% (14/14), 1.26 KiB | 0 bytes/s, done. Total 14 (delta 5), reused 0 (delta 0) remote: tika git commit: Record change for TIKA-1882 this closes #82. remote: tika git commit: Fix for TIKA-1882: .cab, .xar, .mobi and .mov files from the TREC-DD-Polar dataset. This closes #82. To https://git-wip-us.apache.org/repos/asf/tika.git f61a4ed..3d59471 master -> master Branch master set up to track remote branch master from origin. LMC-053601:tika1.13 mattmann$
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #958 (See https://builds.apache.org/job/tika-trunk-jdk1.7/958/)
          Fix for TIKA-1882: .cab, .xar, .mobi and .mov files from the (mattmann: rev 1f96a0e9446fbd89fd724f12f103665b6250f201)

          • tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
            Record change for TIKA-1882 this closes #82. (mattmann: rev 3d59471f9544ff3d9ce64078b235519043928a00)
          • CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #958 (See https://builds.apache.org/job/tika-trunk-jdk1.7/958/ ) Fix for TIKA-1882 : .cab, .xar, .mobi and .mov files from the (mattmann: rev 1f96a0e9446fbd89fd724f12f103665b6250f201) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Record change for TIKA-1882 this closes #82. (mattmann: rev 3d59471f9544ff3d9ce64078b235519043928a00) CHANGES.txt

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              kampasi@usc.edu Manisha Kampasi
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development