Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1813

Figure out file types for several unknown OLE files in Common Crawl

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We're getting around 300 exceptions from "application/x-tika-msoffice" files in our current slice of Common Crawl documents that look roughly like this:

      java.lang.IllegalArgumentException: Position 86528 past the end of the file
          at org.apache.poi.poifs.nio.FileBackedDataSource.read
      

      I suspect these are non-MS OLE file formats. Any help identifying the file types and patching our OLE mime detector would be great.

        Attachments

        1. 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB
          8 kB
          Tim Allison
        2. 25JIANLV77U645GUSJ2E67YSM4B2TNSP
          16 kB
          Tim Allison
        3. 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF
          404 kB
          Tim Allison
        4. 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
          48 kB
          Tim Allison
        5. unidentified_ole_docs_in_common_crawl_slice.csv
          28 kB
          Tim Allison

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tallison@apache.org Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: