Trying to open and resave a Word document with InputStream is = new FileInputStream("/home/esempio.doc"); HWPFDocument docInput = new HWPFDocument(is); OutputStream os = new FileOutputStream("/home/TEST_POI.doc"); docInput.write(os); all fields in document (TOC items, STYLEREF and so on) are destroyed and converted to plain text; for example, a FILENAME field becomes "STYLEREF TitoloDocumento \* MERGEFORMAT esempio.doc". The problem may perhaps reside in control characters handling: in fact, fields in MS Word are represented within normal text, as a sequence like 0x13 <field info> 0x14 <field value> 0x15 and text in POI saved document becomes <field info> <field value> The same problem affects also text extraction: a text portion like File name is [esempio.doc] in which "[esempio.doc]" represents a filename field, becomes File name is STYLEREF TitoloDocumento \* MERGEFORMAT esempio.doc in extracted text. I've partially solved this latter issue using the Java method (s is the text portion to clean) private static String rimuoviCampi(String s) { s = s.replaceAll("\\x13[^\\x13\\x14]*\\x14", ""); s = s.replaceAll("\\x15", ""); s = s.trim(); return s; } but it remains unsolved in document saving. Thanks in advance Domenico
There has been some hwpf work on fields that is in 3.2. Any chance you could re-test and see if it's now fixed?
Created attachment 22872 [details] Sample document for testing writing
Created attachment 22873 [details] Test doc after reading and rewriting
Sorry, but it's still detroying fields... I've tested last POI version source code (TRUNK revision) with attached document, trying to read and write the document as is, with the two lines HWPFDocument doc = new HWPFDocument (new FileInputStream ("/home/jars/Desktop/FieldsTest.doc")); doc.write(new FileOutputStream ("/home/jars/Desktop/FieldsTest after.doc")); where the document (contained in the FIRST attachment; the SECOND attachment contains the resaved document) contains: 1) a "num page" field, rendered *correctly* 2) a "num pages" field, rendered *correctly* 3) a "style ref" field, RENDERED AS TEXT: the original text STYLEREF test with style "TitoloDocumento", becomes "TitoloDocumento"STYLEREF test 4) a "file name" field, RENDERED AS TEXT: the original text (the bare file name with extension) becomes FILENAME FieldsTest.doc 5) a "TOC" field, RENDERED AS TEXT: the original TOC content Heading paragraph in next page 2 Another heading paragraph in further page 3 becomes TOC \f \o "1-9" \t "Intestazione 1;1" Heading paragraph in next page 2 Another heading paragraph in further page 3ยง
The problem is still reproducible in trunk ( as of r1138799) Yegor
Seems to be fixed in trunk. Some formatting is missing thought, but fields are in place, including headers and footers.