41076 – StringIndexOutOfBoundsException when extracting text from a Word document.

Bug 41076 - StringIndexOutOfBoundsException when extracting text from a Word document.

Summary: StringIndexOutOfBoundsException when extracting text from a Word document.

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	POI Overall (show other bugs)
Version:	3.0-dev
Hardware:	Other other

Importance:	P1 critical with 4 votes (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:	http://marc.theaimsgroup.com/?l=poi-u...
Keywords:

Depends on:
Blocks:

Reported:	2006-11-29 05:44 UTC by Bj
Modified:	2008-11-27 08:06 UTC (History)
CC List:	0 users

Attachments
Simplest possible testcase showing the StringIndexOutOfBoundsException (24.00 KB, application/msword) 2006-11-29 05:46 UTC, Bj	Details
Here is a proposed fix to this issue. (1.09 KB, patch) 2007-03-21 09:56 UTC, Steve Polyak	Details \| Diff
A proposed fix which rewrites the loops (3.97 KB, patch) 2007-03-26 10:45 UTC, Eric Porter	Details \| Diff
One file that trigger a StringIndexOutOfBoundsException with POI 3.2 Final (206.00 KB, application/msword) 2008-11-27 08:06 UTC, Olivier Levillain	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bj 2006-11-29 05:44:46 UTC

I use POI through Nutch.

Many Word documents cause the following error when being parsed for text extraction:
Can't be handled as Microsoft document.
java.lang.StringIndexOutOfBoundsException: String index out of range: -520

Comment 1 Bj 2006-11-29 05:46:01 UTC

Created attachment 19200 [details]
Simplest possible testcase showing the StringIndexOutOfBoundsException

Comment 2 Steve Polyak 2007-03-19 15:06:56 UTC

is this fixed in poi-bin-3.0-alpha3-20061212.zip? i just applied these jars and
i still see the same problem.

Comment 3 Steve Polyak 2007-03-21 09:56:21 UTC

Created attachment 19768 [details]
Here is a proposed fix to this issue. 

It simply catches the index out of bounds exception on the substring method
call and returns an empty string in that scenario.

Comment 4 Eric Porter 2007-03-26 10:45:18 UTC

Created attachment 19798 [details]
A proposed fix which rewrites the loops

The code gets a List of text runs and a List of text pieces.  The existing code
fails when the start of one text piece is not the same as the end of the
previous piece.  The assumption is made in several places.
My proposed patch rewrites the loop to make the code smaller and simpler.  The
first proposed patch is made obsolete by this patch because the
StringIndexOutOfBoundsException won't happen anymore.

Comment 5 Nick Burch 2007-03-29 03:49:00 UTC

I might be being stupid, but I can't actually figure out what file the most
recent patch applies to...

The patch header refers to WordExtractor.java, but the code doesn't look
anything like org.apache.poi.hwpf.extractor.WordExtractor

Comment 6 Olivier Levillain 2008-11-27 08:06:14 UTC

Created attachment 22957 [details]
One file that trigger a StringIndexOutOfBoundsException with POI 3.2 Final

I also use POI through Nutch and I tried to install POI 3.2 on Nutch 0.9.1.
Although this bug is marked as fixed in POI 3.0, I can reproduce on many documents (I attached one of them) with POI 3.2 FINAL...