Bug 44431 - HWPFDocument.write destroys fields
Summary: HWPFDocument.write destroys fields
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: unspecified
Hardware: Other other
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-02-15 04:11 UTC by dnapoletano
Modified: 2011-07-24 18:41 UTC (History)
1 user (show)



Attachments
Sample document for testing writing (96.00 KB, application/msword)
2008-11-14 08:02 UTC, dnapoletano
Details
Test doc after reading and rewriting (101.00 KB, application/msword)
2008-11-14 08:03 UTC, dnapoletano
Details

Note You need to log in before you can comment on or make changes to this bug.
Description dnapoletano 2008-02-15 04:11:42 UTC
Trying to open and resave a Word document with

InputStream is = new FileInputStream("/home/esempio.doc");
HWPFDocument docInput = new HWPFDocument(is);
OutputStream os = new FileOutputStream("/home/TEST_POI.doc");
docInput.write(os);

all fields in document (TOC items, STYLEREF and so on) are destroyed and
converted to plain text; for example, a FILENAME field becomes "STYLEREF
TitoloDocumento \* MERGEFORMAT esempio.doc".

The problem may perhaps reside in control characters handling: in fact, fields
in MS Word are represented within normal text, as a sequence like

0x13 <field info> 0x14 <field value> 0x15

and text in POI saved document becomes

<field info> <field value>

The same problem affects also text extraction: a text portion like

File name is [esempio.doc]

in which "[esempio.doc]" represents a filename field, becomes

File name is STYLEREF TitoloDocumento \* MERGEFORMAT esempio.doc

in extracted text.
I've partially solved this latter issue using the Java method (s is the text
portion to clean)

private static String rimuoviCampi(String s) {
	s = s.replaceAll("\\x13[^\\x13\\x14]*\\x14", "");
	s = s.replaceAll("\\x15", "");
	s = s.trim();
	return s;
}

but it remains unsolved in document saving.

Thanks in advance

Domenico
Comment 1 Nick Burch 2008-11-12 06:46:15 UTC
There has been some hwpf work on fields that is in 3.2. Any chance you could re-test and see if it's now fixed?
Comment 2 dnapoletano 2008-11-14 08:02:37 UTC
Created attachment 22872 [details]
Sample document for testing writing
Comment 3 dnapoletano 2008-11-14 08:03:11 UTC
Created attachment 22873 [details]
Test doc after reading and rewriting
Comment 4 dnapoletano 2008-11-14 08:07:05 UTC
Sorry, but it's still detroying fields...

I've tested last POI version source code (TRUNK revision) with attached document, trying to read and write the document as is, with the two lines

HWPFDocument doc = new HWPFDocument (new FileInputStream ("/home/jars/Desktop/FieldsTest.doc"));
      doc.write(new FileOutputStream ("/home/jars/Desktop/FieldsTest after.doc"));

where the document (contained in the FIRST attachment; the SECOND attachment contains the resaved document) contains:

1) a "num page" field, rendered *correctly*

2) a "num pages" field, rendered *correctly*

3) a "style ref" field, RENDERED AS TEXT: the original text 

   STYLEREF test

   with style "TitoloDocumento", becomes

   "TitoloDocumento"STYLEREF test

4) a "file name" field, RENDERED AS TEXT: the original text (the bare file name with extension) becomes

FILENAME FieldsTest.doc

5) a "TOC" field, RENDERED AS TEXT: the original TOC content

Heading paragraph in next page	2
Another heading paragraph in further page	3

becomes

TOC \f \o "1-9" \t "Intestazione 1;1" Heading paragraph in next page	2
Another heading paragraph in further page	3ยง
Comment 5 Yegor Kozlov 2011-06-24 08:17:41 UTC
The problem is still reproducible in trunk ( as of r1138799)

Yegor
Comment 6 Sergey Vladimirov 2011-07-24 18:41:59 UTC
Seems to be fixed in trunk. Some formatting is missing thought, but fields are in place, including headers and footers.