Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-812

Improve the detection of Works Spreadsheet 7.0 files



    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.1
    • 1.1
    • mime
    • None


      This was originally part of ver3 of my patch submitted to TIKA-806.

      Works Spreadsheet files are weird. Versions up to 3.0 used a Quattro Pro magic, version 4.0 used its own magic, while version 7.0 (probably later ones as well) use an OLE2 structure and an MS Office magic. The 7.0 files also contain an entry labelled "Workbook". In Tika this makes both MimeTypes (due to the quirk recently discussed in TIKA-806) and the POIFSContainerDetector label them as Excel.

      "Conceptually" they should be vnd.ms-works, but "technically" they are vnd.ms-excel. A special media type seems like a good compromise, similar in vein to the compromise we reached with TIKA-798.

      I would like to mark them with a new media type: "application/x-tika-msworks-spreadsheet". It would be a subclass of vnd.ms-excel so that:

      1. With pure MimeTypes and no name, ms-excel could be returned.
      2. With MimeTypes with name and data, the correct type could be returned
      3. With POIFSContainerDetector the correct type could be returned
      4. They can also be added to the list of types supported by ExcelParser as it seems to be able to get some content from them


        1. testWORKSSpreadsheet7.0.xlr
          11 kB
          Antoni Mylka
        2. tika-812.patch
          8 kB
          Antoni Mylka
        3. tika-812-ver2.patch
          9 kB
          Antoni Mylka



            Unassigned Unassigned
            antheque Antoni Mylka
            0 Vote for this issue
            0 Start watching this issue