Bug 46610 - [PATCH] Problems accessing documents containing unicode
Summary: [PATCH] Problems accessing documents containing unicode
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.5-dev
Hardware: PC Windows XP
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-27 02:07 UTC by Benjamin Engele
Modified: 2009-06-19 06:51 UTC (History)
1 user (show)



Attachments
A word file that triggers the exception. (26.50 KB, image/doc)
2009-01-27 02:10 UTC, Benjamin Engele
Details
Patch for Exception triggered by utf.doc (773 bytes, patch)
2009-01-27 02:32 UTC, Benjamin Engele
Details | Diff
Triggers a different cause of the Exception (28.00 KB, image/doc)
2009-01-27 02:34 UTC, Benjamin Engele
Details
Patch for Exception triggered by utf2.doc (10.69 KB, patch)
2009-01-27 05:57 UTC, Benjamin Engele
Details | Diff
Patch that fixes all problems with paragraph positions I had (12.80 KB, application/octet-stream)
2009-01-27 13:26 UTC, Benjamin Engele
Details
Unicode patch (12.00 KB, patch)
2009-06-18 07:32 UTC, Maxim Valyanskiy
Details | Diff
Unicode patch v.2 (13.23 KB, patch)
2009-06-19 04:58 UTC, Maxim Valyanskiy
Details | Diff
MSWord file that shows broken paragraph problem (52.50 KB, application/msword)
2009-06-19 04:59 UTC, Maxim Valyanskiy
Details
unit test case (1.43 KB, text/x-java)
2009-06-19 05:53 UTC, Maxim Valyanskiy
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Benjamin Engele 2009-01-27 02:07:45 UTC
Problem is caused by unicode in the word document.
Documents that reproduce the problem are attached.

Code to reproduce:
HWPFDocument doc = new HWPFDocument(new FileInputStream(args[0]));
Range globalRange = doc.getRange();
for (int i = 0; i < globalRange.numParagraphs(); i++) {
	Paragraph p = globalRange.getParagraph(i);
	System.out.println(p.text());
	for (int j = 0; j < p.numCharacterRuns(); j++) {
		CharacterRun characterRun = p.getCharacterRun(j);
		characterRun.text();
	}
}
Comment 1 Benjamin Engele 2009-01-27 02:10:53 UTC
Created attachment 23178 [details]
A word file that triggers the exception.
Comment 2 Benjamin Engele 2009-01-27 02:32:18 UTC
Created attachment 23179 [details]
Patch for Exception triggered by utf.doc
Comment 3 Benjamin Engele 2009-01-27 02:34:45 UTC
Created attachment 23180 [details]
Triggers a different cause of the Exception
Comment 4 Benjamin Engele 2009-01-27 05:57:11 UTC
Created attachment 23181 [details]
Patch for Exception triggered by utf2.doc

Logic that calculates char index from byte index in BytePropertyNode rewritten.
Old approach to check if start index is in a unicode text piece and divide indexes by 2 in that case is wrong.
Comment 5 Benjamin Engele 2009-01-27 05:59:19 UTC
The root problem of this defect also causes other problems like paragraphs and character runs at wrong positions.
Comment 6 Benjamin Engele 2009-01-27 13:13:54 UTC
Patch for Exception triggered by utf2.doc doesn't resolve all problems with utf2.doc: The last paragraph is misplaced. This is happens because of another error in translating byte positions from FormatedDiskPage to char positions in the TextPiece.

Some more notes:
Writing wasn't tested and not changed. It is probably now more broken then it was before. BytePropertyNode.getStartBytes() and getEndBytes() definitely needs to be fixed, they still uses the wrong approach to calculate the byte index from the char index.

IMHO BytePropertyNode.isUnicode() should be removed as soon as get[Start/End]Bytes() has been fixed. Don't think the information that the tart of the node is in a unicode text piece is useful.
Comment 7 Benjamin Engele 2009-01-27 13:26:44 UTC
Created attachment 23184 [details]
Patch that fixes all problems with paragraph positions I had
Comment 8 Maxim Valyanskiy 2009-06-16 05:32:40 UTC
This patch greatly improves text extraction for Cyrillic documents on 3.5beta5.  Unfortunately it breaks few test cases (TestRangeDelete, TestRangeInsertion, TestRangeProperties and TestSectionTable).

Also patch fails to apply on 3.5beta6 and current trunk.
Comment 9 Maxim Valyanskiy 2009-06-18 07:30:35 UTC
I modifed Benjamin Engele patch:

1) Patch ported to current svn trunk (trivial)

2) Corrected getStartBytes()/getEndBytes() methods in BytePropertyNode. This fixes TestRangeDelete, TestRangeInsertion and TestSectionTable tests.

One test is still broken - TestRangeProperties
Comment 10 Maxim Valyanskiy 2009-06-18 07:32:41 UTC
Created attachment 23829 [details]
Unicode patch
Comment 11 Benjamin Engele 2009-06-18 08:09:50 UTC
Actually I didn't look at the test cases so I am no big help finding out why they fail... Happy to see that you managed to solve most test failures.
Comment 12 Maxim Valyanskiy 2009-06-19 04:57:16 UTC
New version:

Bugfixed CPtoFC and remove FCtoCP methods of SectionTable. Now we pass all unit-tests successful
Comment 13 Maxim Valyanskiy 2009-06-19 04:58:24 UTC
Created attachment 23833 [details]
Unicode patch v.2
Comment 14 Maxim Valyanskiy 2009-06-19 04:59:37 UTC
Created attachment 23834 [details]
MSWord file that shows broken paragraph problem
Comment 15 Yegor Kozlov 2009-06-19 05:44:59 UTC
Thanks for researching it. Is the patch ready to be committed?

Yegor
Comment 16 Maxim Valyanskiy 2009-06-19 05:53:30 UTC
Created attachment 23835 [details]
unit test case

src/scratchpad/testcases/org/apache/poi/hwpf/TestBug46610.java
Comment 17 Maxim Valyanskiy 2009-06-19 05:56:48 UTC
Yes, it is ready. This patch does not break existing unit tests and fixes few problems in text extraction. I do not have real world application to test writing. 

Please add attached unit test and put test files into src/scratchpad/testcases/org/apache/poi/hwpf/data/

utf.doc as Bug46610_1.doc
utf2.doc as Bug46610_2.doc
perl_o_fytbole_.doc as Bug46610_3.doc
Comment 18 Yegor Kozlov 2009-06-19 06:51:06 UTC
Benjamin and Maxim,

Thanks for researching this issue and providing the fix. The patch was applied in r786505

Yegor