Tika
  1. Tika
  2. TIKA-1072

AIOOBE when handling embedded document in .doc file

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.7
    • Component/s: parser
    • Labels:
      None

      Description

      I have a Word (.doc) document that hits an exception when I run:

      java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc 
      

      Here's the exception:

      Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
      	at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
      	at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139)
      	at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
      	at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
      	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
      	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
      	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      

      It happens when we try to parse an OLE10 embedded object ... the code
      that does this parsing captures and ignores Ole10NativeException and
      skips the entry ... so I'm wondering if we should also catch AIOOBE
      and skip the entry? Ie, maybe this entry really is not OLE10, and the
      Ole10Native code is failing to throw Ole10NativeException for it?

      1. Ole10NativeEntry.bin
        0.0 kB
        Michael McCandless
      2. 20-Force-on-a-current-S00.doc
        57 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          OK I did some digging on this. The DirectoryNode of this embedded document has these entries:

          ent=PICT size=797
          ent=ObjInfo size=4
          ent=Ole10Native size=40
          ent=Ole10FmtProgID size=13
          ent=OlePres000 size=40
          ent=CompObj size=82
          ent=PIC size=100
          ent=META size=582
          ent=Ole size=20
          

          And so I believe it really is an OLE10Native record... OLE10Native then tries to parse it, with plain=false, but then runs out of bytes on this line:

                flags2 = LittleEndian.getShort(data, ofs);
          

          It seems likely something is corrupt about this entry? Does 40 bytes seem way too small for an OLE10Native entry? If so, I wonder if we could fix AbstractPOIFSExtractor to log the exception and then skip this one embedded document and then go on to parsing the others? Ie, isolate the exception, rather than aborting the entire extraction; in this case the main document extracts fine.

          Show
          Michael McCandless added a comment - OK I did some digging on this. The DirectoryNode of this embedded document has these entries: ent=PICT size=797 ent=ObjInfo size=4 ent=Ole10Native size=40 ent=Ole10FmtProgID size=13 ent=OlePres000 size=40 ent=CompObj size=82 ent=PIC size=100 ent=META size=582 ent=Ole size=20 And so I believe it really is an OLE10Native record... OLE10Native then tries to parse it, with plain=false, but then runs out of bytes on this line: flags2 = LittleEndian.getShort(data, ofs); It seems likely something is corrupt about this entry? Does 40 bytes seem way too small for an OLE10Native entry? If so, I wonder if we could fix AbstractPOIFSExtractor to log the exception and then skip this one embedded document and then go on to parsing the others? Ie, isolate the exception, rather than aborting the entire extraction; in this case the main document extracts fine.
          Hide
          Nick Burch added a comment -

          dev@poi.apache.org is probably the best place to ask for advice, hopefully someone lurking there will have more idea about the inner workings of the OLE10Native stuff.

          (Failing that, you'd need to go and get the file format specs from the Microsoft website, and see what they say about the validity)

          Show
          Nick Burch added a comment - dev@poi.apache.org is probably the best place to ask for advice, hopefully someone lurking there will have more idea about the inner workings of the OLE10Native stuff. (Failing that, you'd need to go and get the file format specs from the Microsoft website, and see what they say about the validity)
          Hide
          Michael McCandless added a comment -

          Thanks Nick, I'll try asking on dev@poi.

          I'll open a separate issue about continuing parsing even when an embedded doc hits an exception ...

          Show
          Michael McCandless added a comment - Thanks Nick, I'll try asking on dev@poi. I'll open a separate issue about continuing parsing even when an embedded doc hits an exception ...
          Hide
          Michael McCandless added a comment -

          OK I opened TIKA-1074; this issue will explore whether this document is corrupt or not ...

          Show
          Michael McCandless added a comment - OK I opened TIKA-1074 ; this issue will explore whether this document is corrupt or not ...
          Hide
          Michael McCandless added a comment -

          I'm attaching the 40 byte \U0001Ole10Native entry (40 bytes); here's the hex dump:

          00000000 24 00 00 00 02 00 01 01 00 0a 01 12 83 46 02 86 |$............F..|
          00000010 3d 12 83 49 12 83 6c 12 83 42 12 82 73 12 82 69 |=..I..l..B..s..i|
          00000020 12 82 6e 02 84 71 00 00 |..n..q..|
          00000028

          Show
          Michael McCandless added a comment - I'm attaching the 40 byte \U0001Ole10Native entry (40 bytes); here's the hex dump: 00000000 24 00 00 00 02 00 01 01 00 0a 01 12 83 46 02 86 |$............F..| 00000010 3d 12 83 49 12 83 6c 12 83 42 12 82 73 12 82 69 |=..I..l..B..s..i| 00000020 12 82 6e 02 84 71 00 00 |..n..q..| 00000028
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.5, get ready for 1.4 RC #1.
          Show
          Chris A. Mattmann added a comment - push to 1.5, get ready for 1.4 RC #1.
          Hide
          Dave Meikle added a comment -

          Pushed out to 1.6, preparing for 1.5 RC

          Show
          Dave Meikle added a comment - Pushed out to 1.6, preparing for 1.5 RC

            People

            • Assignee:
              Unassigned
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development