Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-812

Improve the detection of Works Spreadsheet 7.0 files

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.1
    • 1.1
    • mime
    • None

    Description

      This was originally part of ver3 of my patch submitted to TIKA-806.

      Works Spreadsheet files are weird. Versions up to 3.0 used a Quattro Pro magic, version 4.0 used its own magic, while version 7.0 (probably later ones as well) use an OLE2 structure and an MS Office magic. The 7.0 files also contain an entry labelled "Workbook". In Tika this makes both MimeTypes (due to the quirk recently discussed in TIKA-806) and the POIFSContainerDetector label them as Excel.

      "Conceptually" they should be vnd.ms-works, but "technically" they are vnd.ms-excel. A special media type seems like a good compromise, similar in vein to the compromise we reached with TIKA-798.

      I would like to mark them with a new media type: "application/x-tika-msworks-spreadsheet". It would be a subclass of vnd.ms-excel so that:

      1. With pure MimeTypes and no name, ms-excel could be returned.
      2. With MimeTypes with name and data, the correct type could be returned
      3. With POIFSContainerDetector the correct type could be returned
      4. They can also be added to the list of types supported by ExcelParser as it seems to be able to get some content from them

      Attachments

        1. tika-812-ver2.patch
          9 kB
          Antoni Mylka
        2. tika-812.patch
          8 kB
          Antoni Mylka
        3. testWORKSSpreadsheet7.0.xlr
          11 kB
          Antoni Mylka

        Activity

          People

            Unassigned Unassigned
            antheque Antoni Mylka
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: