Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3992

Add common missing mimes based on Common Crawl data

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as detected by Tika. It would be useful to extract those (even if truncated) and run 'file' and 'siegfried' against those file types that are unknown to Tika. We can prioritize the most common file formats as identified by file and siegfried for addition to our mime-types.xml.

      Separately, we might also want to do the same thing for `application/zip`...there are likely zip-based file types that we could do a better job on.

      Thanks to snagel for a dump of stats on the most recent crawl.

      Attachments

        1. mimes.zip
          27.67 MB
          Tim Allison

        Issue Links

          1.
          audio/xm audio/x-mod Sub-task Open Unassigned
          2.
          Improve detection of text files Sub-task Open Unassigned
          3.
          Add file extension .rmd160 to tika-mimetypes.xml Sub-task Open Unassigned
          4.
          Add magic for ASPRS Lidar data Sub-task Reopened Unassigned
          5.
          Add magic for FAT Disk Image format Sub-task Open Unassigned
          6.
          Add magic for Mach-O format Sub-task Open Unassigned
          7.
          Add magic for MS-DOS Compression Format (SZDD Variant) Sub-task Open Unassigned
          8.
          Add magic for Jigdo Download Template format Sub-task Open Unassigned
          9.
          Add magic for Atari Floppy Disk Image Format Sub-task Open Unassigned
          10.
          Add magic for Guitar Pro format Sub-task Open Unassigned
          11.
          Add magic for TeX Virtual Font format Sub-task Open Unassigned
          12.
          Add magic for GRAPPA Database RADX File Sub-task Open Unassigned
          13.
          Add magic for Touhou Project Replay File format Sub-task Open Unassigned
          14.
          Add magic for Modified Maximum Method Digisonde Portable Sounder File format Sub-task Open Unassigned
          15.
          Add magic for IDL Binary Format Save File format Sub-task Open Unassigned
          16.
          Add magic for Planetary Data System Version 3 format Sub-task Open Unassigned
          17.
          Add magic for Planetary Data System Version 2 format Sub-task Open Unassigned
          18.
          Add magic for ZIM format Sub-task Open Unassigned
          19.
          Add magic for ClamAV CDiff files Sub-task Open Unassigned
          20.
          Add magic for SquashFS Format Sub-task Open Unassigned
          21.
          Add magic for Unreal Engine Package format Sub-task Open Unassigned
          22.
          Add magic for X-Moto Replay format Sub-task Open Unassigned
          23.
          Add magic for Warcraft III Map format Sub-task Open Unassigned
          24.
          Add magic for SEG Y format Sub-task Open Unassigned
          25.
          Add magic for Teeworlds/DDRace Map Format Sub-task Open Unassigned
          26.
          Add magic for SolidWorks eDrawing Electronic Assembly Data File format Sub-task Open Unassigned

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: