Problem is caused by unicode in the word document. Documents that reproduce the problem are attached. Code to reproduce: HWPFDocument doc = new HWPFDocument(new FileInputStream(args[0])); Range globalRange = doc.getRange(); for (int i = 0; i < globalRange.numParagraphs(); i++) { Paragraph p = globalRange.getParagraph(i); System.out.println(p.text()); for (int j = 0; j < p.numCharacterRuns(); j++) { CharacterRun characterRun = p.getCharacterRun(j); characterRun.text(); } }
Created attachment 23178 [details] A word file that triggers the exception.
Created attachment 23179 [details] Patch for Exception triggered by utf.doc
Created attachment 23180 [details] Triggers a different cause of the Exception
Created attachment 23181 [details] Patch for Exception triggered by utf2.doc Logic that calculates char index from byte index in BytePropertyNode rewritten. Old approach to check if start index is in a unicode text piece and divide indexes by 2 in that case is wrong.
The root problem of this defect also causes other problems like paragraphs and character runs at wrong positions.
Patch for Exception triggered by utf2.doc doesn't resolve all problems with utf2.doc: The last paragraph is misplaced. This is happens because of another error in translating byte positions from FormatedDiskPage to char positions in the TextPiece. Some more notes: Writing wasn't tested and not changed. It is probably now more broken then it was before. BytePropertyNode.getStartBytes() and getEndBytes() definitely needs to be fixed, they still uses the wrong approach to calculate the byte index from the char index. IMHO BytePropertyNode.isUnicode() should be removed as soon as get[Start/End]Bytes() has been fixed. Don't think the information that the tart of the node is in a unicode text piece is useful.
Created attachment 23184 [details] Patch that fixes all problems with paragraph positions I had
This patch greatly improves text extraction for Cyrillic documents on 3.5beta5. Unfortunately it breaks few test cases (TestRangeDelete, TestRangeInsertion, TestRangeProperties and TestSectionTable). Also patch fails to apply on 3.5beta6 and current trunk.
I modifed Benjamin Engele patch: 1) Patch ported to current svn trunk (trivial) 2) Corrected getStartBytes()/getEndBytes() methods in BytePropertyNode. This fixes TestRangeDelete, TestRangeInsertion and TestSectionTable tests. One test is still broken - TestRangeProperties
Created attachment 23829 [details] Unicode patch
Actually I didn't look at the test cases so I am no big help finding out why they fail... Happy to see that you managed to solve most test failures.
New version: Bugfixed CPtoFC and remove FCtoCP methods of SectionTable. Now we pass all unit-tests successful
Created attachment 23833 [details] Unicode patch v.2
Created attachment 23834 [details] MSWord file that shows broken paragraph problem
Thanks for researching it. Is the patch ready to be committed? Yegor
Created attachment 23835 [details] unit test case src/scratchpad/testcases/org/apache/poi/hwpf/TestBug46610.java
Yes, it is ready. This patch does not break existing unit tests and fixes few problems in text extraction. I do not have real world application to test writing. Please add attached unit test and put test files into src/scratchpad/testcases/org/apache/poi/hwpf/data/ utf.doc as Bug46610_1.doc utf2.doc as Bug46610_2.doc perl_o_fytbole_.doc as Bug46610_3.doc
Benjamin and Maxim, Thanks for researching this issue and providing the fix. The patch was applied in r786505 Yegor