Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1813

Figure out file types for several unknown OLE files in Common Crawl

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      We're getting around 300 exceptions from "application/x-tika-msoffice" files in our current slice of Common Crawl documents that look roughly like this:

      java.lang.IllegalArgumentException: Position 86528 past the end of the file
          at org.apache.poi.poifs.nio.FileBackedDataSource.read
      

      I suspect these are non-MS OLE file formats. Any help identifying the file types and patching our OLE mime detector would be great.

      Attachments

        1. unidentified_ole_docs_in_common_crawl_slice.csv
          28 kB
          Tim Allison
        2. 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
          48 kB
          Tim Allison
        3. 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF
          404 kB
          Tim Allison
        4. 25JIANLV77U645GUSJ2E67YSM4B2TNSP
          16 kB
          Tim Allison
        5. 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB
          8 kB
          Tim Allison

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: