Tika
  1. Tika
  2. TIKA-913

MagicMime detection of msdos executables does not work

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: 1.2
    • Component/s: mime
    • Labels:
    • Environment:

      Linux, JDK 1.6

      Description

      Mime detection does not work as expected (at least from me) in contrast e.g. to sourceforge mime-util detection or "file" utility.
      For example using putty ms-dos executable does result in wrong detections:

      krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty
      application/octet-stream
      krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.jpg
      image/jpeg
      krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.exe
      application/x-msdownload

      Its everytime the same binary resource only with different names.
      In contrast using "file" does output:

      krah@sf050:~$ file /tmp/putty
      /tmp/putty: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
      krah@sf050:~$ file /tmp/putty.jpg
      /tmp/putty.jpg: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
      krah@sf050:~$ file /tmp/putty.exe
      /tmp/putty.exe: PE32 executable for MS Windows (GUI) Intel 80386 32-bit

      So magic mime detection should be able to detect that this is actually an executable.

      E.g. for a PDF it does work:

      krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.pdf
      application/pdf
      krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print
      application/pdf
      krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.jpg
      application/pdf

      Here Tika detects what is expected.

        Activity

        Torsten Krah created issue -
        Hide
        Nick Burch added a comment -

        I believe that MS-DOS executables are not actually PE32 files. http://wiki.osdev.org/PE seems to have some good details on the PE32 and PE64 formats that should help for detection

        Show
        Nick Burch added a comment - I believe that MS-DOS executables are not actually PE32 files. http://wiki.osdev.org/PE seems to have some good details on the PE32 and PE64 formats that should help for detection
        Hide
        Nick Burch added a comment -

        If anyone wanted to add a parser for PE(32/64) files, then this doc should be handy: <http://msdn.microsoft.com/en-us/windows/hardware/gg463119.aspx>. We should be able to get the odd common thing, like creation date, along with lots of other info too

        Based on this info, and the osdev page, I've added mime magic for what look to be the common variants in r1336610.

        Show
        Nick Burch added a comment - If anyone wanted to add a parser for PE(32/64) files, then this doc should be handy: < http://msdn.microsoft.com/en-us/windows/hardware/gg463119.aspx >. We should be able to get the odd common thing, like creation date, along with lots of other info too Based on this info, and the osdev page, I've added mime magic for what look to be the common variants in r1336610.
        Nick Burch made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.2 [ 12320169 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Torsten Krah
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development