46610 – [PATCH] Problems accessing documents containing unicode

Bug 46610 - [PATCH] Problems accessing documents containing unicode

Summary: [PATCH] Problems accessing documents containing unicode

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	HWPF (show other bugs)
Version:	3.5-dev
Hardware:	PC Windows XP

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-01-27 02:07 UTC by Benjamin Engele
Modified:	2009-06-19 06:51 UTC (History)
CC List:	1 user (show)

Attachments
A word file that triggers the exception. (26.50 KB, image/doc) 2009-01-27 02:10 UTC, Benjamin Engele	Details
Patch for Exception triggered by utf.doc (773 bytes, patch) 2009-01-27 02:32 UTC, Benjamin Engele	Details \| Diff
Triggers a different cause of the Exception (28.00 KB, image/doc) 2009-01-27 02:34 UTC, Benjamin Engele	Details
Patch for Exception triggered by utf2.doc (10.69 KB, patch) 2009-01-27 05:57 UTC, Benjamin Engele	Details \| Diff
Patch that fixes all problems with paragraph positions I had (12.80 KB, application/octet-stream) 2009-01-27 13:26 UTC, Benjamin Engele	Details
Unicode patch (12.00 KB, patch) 2009-06-18 07:32 UTC, Maxim Valyanskiy	Details \| Diff
Unicode patch v.2 (13.23 KB, patch) 2009-06-19 04:58 UTC, Maxim Valyanskiy	Details \| Diff
MSWord file that shows broken paragraph problem (52.50 KB, application/msword) 2009-06-19 04:59 UTC, Maxim Valyanskiy	Details
unit test case (1.43 KB, text/x-java) 2009-06-19 05:53 UTC, Maxim Valyanskiy	Details
Show Obsolete (4) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Benjamin Engele 2009-01-27 02:07:45 UTC

Problem is caused by unicode in the word document.
Documents that reproduce the problem are attached.

Code to reproduce:
HWPFDocument doc = new HWPFDocument(new FileInputStream(args[0]));
Range globalRange = doc.getRange();
for (int i = 0; i < globalRange.numParagraphs(); i++) {
	Paragraph p = globalRange.getParagraph(i);
	System.out.println(p.text());
	for (int j = 0; j < p.numCharacterRuns(); j++) {
		CharacterRun characterRun = p.getCharacterRun(j);
		characterRun.text();
	}
}

Comment 1 Benjamin Engele 2009-01-27 02:10:53 UTC

Created attachment 23178 [details]
A word file that triggers the exception.

Comment 2 Benjamin Engele 2009-01-27 02:32:18 UTC

Created attachment 23179 [details]
Patch for Exception triggered by utf.doc

Comment 3 Benjamin Engele 2009-01-27 02:34:45 UTC

Created attachment 23180 [details]
Triggers a different cause of the Exception

Comment 4 Benjamin Engele 2009-01-27 05:57:11 UTC

Created attachment 23181 [details]
Patch for Exception triggered by utf2.doc

Logic that calculates char index from byte index in BytePropertyNode rewritten.
Old approach to check if start index is in a unicode text piece and divide indexes by 2 in that case is wrong.

Comment 5 Benjamin Engele 2009-01-27 05:59:19 UTC

The root problem of this defect also causes other problems like paragraphs and character runs at wrong positions.

Comment 6 Benjamin Engele 2009-01-27 13:13:54 UTC

Patch for Exception triggered by utf2.doc doesn't resolve all problems with utf2.doc: The last paragraph is misplaced. This is happens because of another error in translating byte positions from FormatedDiskPage to char positions in the TextPiece.

Some more notes:
Writing wasn't tested and not changed. It is probably now more broken then it was before. BytePropertyNode.getStartBytes() and getEndBytes() definitely needs to be fixed, they still uses the wrong approach to calculate the byte index from the char index.

IMHO BytePropertyNode.isUnicode() should be removed as soon as get[Start/End]Bytes() has been fixed. Don't think the information that the tart of the node is in a unicode text piece is useful.

Comment 7 Benjamin Engele 2009-01-27 13:26:44 UTC

Created attachment 23184 [details]
Patch that fixes all problems with paragraph positions I had

Comment 8 Maxim Valyanskiy 2009-06-16 05:32:40 UTC

This patch greatly improves text extraction for Cyrillic documents on 3.5beta5.  Unfortunately it breaks few test cases (TestRangeDelete, TestRangeInsertion, TestRangeProperties and TestSectionTable).

Also patch fails to apply on 3.5beta6 and current trunk.

Comment 9 Maxim Valyanskiy 2009-06-18 07:30:35 UTC

I modifed Benjamin Engele patch:

1) Patch ported to current svn trunk (trivial)

2) Corrected getStartBytes()/getEndBytes() methods in BytePropertyNode. This fixes TestRangeDelete, TestRangeInsertion and TestSectionTable tests.

One test is still broken - TestRangeProperties

Comment 10 Maxim Valyanskiy 2009-06-18 07:32:41 UTC

Created attachment 23829 [details]
Unicode patch

Comment 11 Benjamin Engele 2009-06-18 08:09:50 UTC

Actually I didn't look at the test cases so I am no big help finding out why they fail... Happy to see that you managed to solve most test failures.

Comment 12 Maxim Valyanskiy 2009-06-19 04:57:16 UTC

New version:

Bugfixed CPtoFC and remove FCtoCP methods of SectionTable. Now we pass all unit-tests successful

Comment 13 Maxim Valyanskiy 2009-06-19 04:58:24 UTC

Created attachment 23833 [details]
Unicode patch v.2

Comment 14 Maxim Valyanskiy 2009-06-19 04:59:37 UTC

Created attachment 23834 [details]
MSWord file that shows broken paragraph problem

Comment 15 Yegor Kozlov 2009-06-19 05:44:59 UTC

Thanks for researching it. Is the patch ready to be committed?

Yegor

Comment 16 Maxim Valyanskiy 2009-06-19 05:53:30 UTC

Created attachment 23835 [details]
unit test case

src/scratchpad/testcases/org/apache/poi/hwpf/TestBug46610.java

Comment 17 Maxim Valyanskiy 2009-06-19 05:56:48 UTC

Yes, it is ready. This patch does not break existing unit tests and fixes few problems in text extraction. I do not have real world application to test writing. 

Please add attached unit test and put test files into src/scratchpad/testcases/org/apache/poi/hwpf/data/

utf.doc as Bug46610_1.doc
utf2.doc as Bug46610_2.doc
perl_o_fytbole_.doc as Bug46610_3.doc

Comment 18 Yegor Kozlov 2009-06-19 06:51:06 UTC

Benjamin and Maxim,

Thanks for researching this issue and providing the fix. The patch was applied in r786505

Yegor