Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1813

Figure out file types for several unknown OLE files in Common Crawl

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We're getting around 300 exceptions from "application/x-tika-msoffice" files in our current slice of Common Crawl documents that look roughly like this:

      java.lang.IllegalArgumentException: Position 86528 past the end of the file
          at org.apache.poi.poifs.nio.FileBackedDataSource.read
      

      I suspect these are non-MS OLE file formats. Any help identifying the file types and patching our OLE mime detector would be great.

        Attachments

        1. 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
          48 kB
          Tim Allison
        2. 25JIANLV77U645GUSJ2E67YSM4B2TNSP
          16 kB
          Tim Allison
        3. 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB
          8 kB
          Tim Allison
        4. 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF
          404 kB
          Tim Allison
        5. unidentified_ole_docs_in_common_crawl_slice.csv
          28 kB
          Tim Allison

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tallison@mitre.org Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: