Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-577

IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no pictures

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 0.8, 0.9
    • None
    • parser
    • None
    • Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4

    Description

      When cracking a Word 03 document (which, unfortunately, I cannot upload – it has client-confidential data), an index out of bounds exception occurs in the POI code used by the WordExtractor. To try to make up for the unavailable doc file, I've included the results of a couple of hours stepping through the code to find the failure point. The error occurs because point[0] = point[1] = 301; upperbound of _paragraphs = 301. This is in the method org.apache.poi.hwpf.usermodel.CharacterRun() .

      The method + line numbers are:

      public CharacterRun getCharacterRun(int index)

      line 792: int[] point = findRange(_paragraphs, _parStart, Math.max(chpx.getStart(), _start), chpx.getEnd());
      line 794: PAPX papx = _paragraphs.get(point[0]); // <<< This is the source of the exception

      STACK at time of exception:

      Range.GetCharacterRun(int) line 794
      PicturesTable.getAllPictures() line 191
      WordExtractor$PicturesSource.<init>(HPWFDocument) line 429
      WordExtractor$PicturesSource.<init>(HPWFDocument, WordExtractor#1) line 419
      WordExtractor.parse(POIFSFileSystem, XHTMLContentHandler) line 75
      OfficeParser.parse(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext) line 187
      DefaulttParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext) line 197
      AutoDetectParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext) line 197
      AutoDetectParser.parse(InputStream, ContentHandler, Metadata, ParseContext) line 137
      ... (my project) ...

      As noted, this occurs in a Word 2003 doc which has no pictures (it is a table); 147 character runs (0 - 146) found in first pass. Problem occurs on
      first pass (not sure if there will be others) on this run. Last run in this code section from org.apache.poi.hwpf.model.PicturesTable.getAllPictures(),
      lines 186-191:

      public List<Picture> getAllPictures() {
      ArrayList<Picture> pictures = new ArrayList<Picture>();

      Range range = _document.getOverallRange();
      for (int i = 0; i < range.numCharacterRuns(); i++) {
      CharacterRun run = range.getCharacterRun;

      Error occurs on getCharacterRun when i = 146, which is the last run in the range. If I change point[0] to 300 (in getCharacterRun), the call returns nicely to
      WordExtractor$PicturesSource<init>(HPWFDocument) line 429, setting the List all to an empty List. Fails again later on subsequent call to
      getAllPictures with same error.

      POTENTIAL FIX: if point[0] > papx.Length then return an EMPTY CharacterRun for the paragraph in question.
      Cannot send repro document - contains confidential client data.

      Attachments

        1. X'd Out Doc for Tika.doc
          64 kB
          Dennis Adler

        Activity

          People

            Unassigned Unassigned
            dennisad Dennis Adler
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: