Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1487

Add mime for pre-OLE2 xls file

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • None
    • 1.7
    • None
    • None

    Description

      On the govdocs1 corpus, nearly 91% of xls exceptions have this stacktrace:

      Caused by: java.io.IOException: Invalid header signature; read 0x0010000000060409, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140) at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:198) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:184) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:162) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 13 more
      

      Excel is able to open the few files that I tried, and it looks like Excel thinks these are version 4.

      On the POI user list, gagravarr identified this header as pre-OLE2 and asked that we add the mime to Tika so that we can handle appropriately. Test file soon to be attached.

      Attachments

        1. 004444.xls
          39 kB
          Tim Allison

        Activity

          People

            Unassigned Unassigned
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: