There is extractor (converter?) from DOC to FO in hdf package, but it's not based on HWPF code. Neither support images or customization. in patch new extractor is proposed: - it is based on HWPF code - it is using DOM creation of FO document, not string building - with correct implementation of ImageHandler it can even convert MathType equations (i.e. extract WMF of those and let your ImageHandler do everything) Some things are not tested yet (for example, images, shapes or nested tables), but current code already creates nice PDF documents (with Apache FOP)
Created attachment 27143 [details] patch
Thanks for the patch. Can you upload some samples so we can test your code? It would be nice to see a source .doc file, the produced XSL-FO and the resulting PDF. How do you run FOP, via command line? Please post the exact command. Regards, Yegor
How much efforts are you going to invest in this utility? In its current form it is rather a proof of concept than a full-featured convertor. I was able to generate XSL-FO for some files from our collection of test Word documents. FOP 1.0 renders simple files, but stumbles on more complex ones: Several times I've seen this: Caused by: org.xml.sax.SAXParseException: Character reference "" is an invalid XML character. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) and this: javax.xml.transform.TransformerException: org.apache.fop.fo.ValidationException: Border and padding for fo:region-body "xsl-region-body" sho ld be '0' (See 6.4.14 in XSL 1.1); non-standard values are allowed if relaxed validation is enabled. (See position 5:411) at org.apache.fop.cli.InputHandler.transformTo(InputHandler.java:302) at org.apache.fop.cli.InputHandler.renderTo(InputHandler.java:130) at org.apache.fop.cli.Main.startFOP(Main.java:174) Also, I see warnings in the console: Jun 11, 2011 7:19:10 PM org.apache.fop.events.LoggingEventListener processEvent SEVERE: Invalid property value encountered in linefeed-treatment="false": org.apache.fop.fo.expr.PropertyException: file:/C:/temp/DiffFirstP ageHeadFoot.xml:9:38: No conversion defined false; property:'linefeed-treatment' (See position 10:520) Jun 11, 2011 7:19:10 PM org.apache.fop.events.LoggingEventListener processEvent SEVERE: Invalid property value encountered in linefeed-treatment="false": org.apache.fop.fo.expr.PropertyException: file:/C:/temp/DiffFirstP ageHeadFoot.xml:9:38: No conversion defined false; property:'linefeed-treatment' (See position 16:520) Jun 11, 2011 7:19:10 PM org.apache.fop.events.LoggingEventListener processEvent SEVERE: Invalid property value encountered in keep-together.within-page="true": org.apache.fop.fo.expr.PropertyException: file:/C:/temp/Diff FirstPageHeadFoot.xml:9:38: No conversion defined true; property:'keep-together.within-page' (See position 20:578) Jun 11, 2011 7:19:10 PM org.apache.fop.events.LoggingEventListener processEvent SEVERE: Invalid property value encountered in keep-with-next.within-page="true": org.apache.fop.fo.expr.PropertyException: file:/C:/temp/Dif fFirstPageHeadFoot.xml:9:38: No conversion defined true; property:'keep-with-next.within-page' (See position 20:578) Jun 11, 2011 7:19:10 PM org.apache.fop.events.LoggingEventListener processEvent WordToFOExtractor is a very promising feature, the question is where it should live: in poi-examples as a simple demo, or with the rest of HWPF code as a well-tested code. Regards, Yegor
Yegor, I will test and upload updated patch and examples using test doc files from POI collections of test files. Currently FOP is called using additional bunch of code, including image handling. This is tightly linked to our internal system (for example, it's includes converting images from WMF to SVG format), so extracting an example is tricky. This is currently proof-of-concept code, but i believe it will be part of production system in several month. Also I believe functionality of new extractor is better (at least not worse) then old org.apache.poi.hdf.extractor package. Personally I intent to continue supporting this code until it will be ready for production usage and may be implement xls-to-fo extractor. My goal to create one-way converter with maximum readability, i.e. without lost text but may be without some formatting. Should I also notice that doc-to-html shall be easy to implement ? ;) Regards, Sergey
Created attachment 27153 [details] Updated patch Fixed "keep" attributes; Removed ImageHandler interface (handle images by extending extractor)
POI doc sources, fo xml and PDF results http://www.sendspace.com/file/sf4o24 2.5 Mb in size
Patch applied in r1135414. I added Java main() to WordToFoExtractor, this is how I tested your code: Usage: WordToFoExtractor <inputFile.doc> <saveTo.fo> Except for formatting and indentation, the output is identical to the XML in the uploaded archive. > > Personally I intent to continue supporting this code until it will be ready for > production usage and may be implement xls-to-fo extractor. My goal to create > one-way converter with maximum readability, i.e. without lost text but may be > without some formatting. > Great! It will be a really valuable contribution. A xls-to-fo converter has been asked several times on the mailing lists. We already have ToCSV and ToHtml apps in poi-examples and xls-to-fo can borrow/ share code from them. > Should I also notice that doc-to-html shall be easy to implement ? ;) > That would be nice to have too. P.S. I'm leaving this ticket open for updates. Close it if you prefer to upload new patches in a new ticket. Regards, Yegor
Created attachment 27155 [details] Add most of images handling, except cropping Added all possible image handling, except cropping - i can't find a way to obtain this information, neither pictures SPRM. See testPictures.doc.pdf and testCroppedPictures.doc.pdf for examples. http://www.sendspace.com/file/adie2l
Applied in r1136001 It looks like you attached not the most recent version of your code - createExternalGraphic is never called and the fo:external-graphic element is missing in the output. No prob - I expect it in next patches. Regards, Yegor
Yegor, Graphic handling won't be part of extractor code. It's a lot of additional code AND additional libraries like Apache Batik or even ImageMagic calls. Also file creation and cleaning up should be coded. So there is an empty processImage() method that should be implemented in subclass if anyone want image to be included in XSL FO. createExternalGraphic() and setImageProperties() are helper methods for those people.
> > Graphic handling won't be part of extractor code. It's a lot of additional code > AND additional libraries like Apache Batik or even ImageMagic calls. Also file > creation and cleaning up should be coded. > > So there is an empty processImage() method that should be implemented in > subclass if anyone want image to be included in XSL FO. createExternalGraphic() > and setImageProperties() are helper methods for those people. I see, but we can provide default support for png/jpeg with minimum efforts! I added the following code and it worked for me: protected void processImage(Element currentBlock, boolean inlined, Picture picture) { byte[] bytes = picture.getContent(); String ext = picture.getMimeType(); if(ext.equals("image/jpeg") || ext.equals("image/png")){ File file = new File(picture.suggestFullFileName()); try { // dump images in the work dir FileOutputStream out = new FileOutputStream(file); out.write(bytes); out.close(); Element graphics = createExternalGraphic(file.toURI().toASCIIString()); WordToFoUtils.setPictureProperties(picture, graphics); currentBlock.appendChild(graphics); } catch (IOException e){ e.printStackTrace(); } } } I agree that handling other mimetypes is not trivial and may involve third-party libraries, but jpeg and png are most commons and should be supported by default. Does it make sense for you? Regards, Yegor
Yegor, Yes, the provided code can correctly words for png and jpeg images. i shall assume, FOP can also handle BMP, TIFFs and GIFs, so they can be listed there as well. May be even WMF, according to http://xmlgraphics.apache.org/fop/0.95/graphics.html (but i wouldn't advise to assume it). But the main question is about cleaning up the files after work. Where those image files shall be stored? Who and when should delete them? What happens in can of exception with those files? (and for your code - what happens in case of parallel processing?) Either this part is handled by external code, or it can be handled by Extractor code. In second case we will need some kind of close() or cleanup() method to delete those files after FOP processing. Regards, Sergey. P.S.: I subscribed to poi-user/poi-dev mailists, so we can move discussion there.
Created attachment 27177 [details] new patch Add hyperlinks support; Add common fields support; Split WordToFoExtractor and AbstractToFoExtractor
Created attachment 27178 [details] Additional test docs
Created attachment 27198 [details] Patch to fix ListEntryNoListTable and MBD001D0B89 tests; additional tests
Applied in r1138836 Yegor (In reply to comment #15) > Created attachment 27198 [details] > Patch to fix ListEntryNoListTable and MBD001D0B89 tests; additional tests
I made a small change in TestWordToFoExtractorSuite and added an option to exclude certain files from the suite. I resolved an old Bug 33519 which complained that HWPF failed on open a document and added the problematic file to our test collection. As result, TestWordToFoExtractorSuite started to fail on Bug33519.doc with a NPE: java.lang.NullPointerException at org.apache.poi.hwpf.extractor.WordToFoExtractor.processCharacters(WordToFoExtractor.java:255) at org.apache.poi.hwpf.extractor.WordToFoExtractor.processParagraph(WordToFoExtractor.java:492) at org.apache.poi.hwpf.extractor.WordToFoExtractor.processSectionParagraphes(WordToFoExtractor.java:571) at org.apache.poi.hwpf.extractor.WordToFoExtractor.processSection(WordToFoExtractor.java:519) at org.apache.poi.hwpf.extractor.WordToFoExtractor.processDocument(WordToFoExtractor.java:332) at org.apache.poi.hwpf.extractor.WordToFoExtractor.process(WordToFoExtractor.java:167) Yegor
Created attachment 27204 [details] Workaround for NPE in Bug 33519 (In reply to comment #17) > I resolved an old Bug 33519 which complained that HWPF failed on open a > document and added the problematic file to our test collection. As result, > TestWordToFoExtractorSuite started to fail on Bug33519.doc with a NPE: Wordkaround in proposed patch.
Created attachment 27205 [details] Workaround for NPE in Bug 33519
It still fails on Bug33519.doc, but with a different exception: java.lang.IllegalArgumentException: The end (1077) must not be before the start (1985) at org.apache.poi.hwpf.usermodel.Range.sanityCheckStartEnd(Range.java:243) at org.apache.poi.hwpf.usermodel.Range.<init>(Range.java:176) at org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:97) at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:802) at org.apache.poi.hwpf.extractor.WordToFoExtractor.processCharacters(WordToFoExtractor.java:243) at org.apache.poi.hwpf.extractor.WordToFoExtractor.processParagraph(WordToFoExtractor.java:529) at org.apache.poi.hwpf.extractor.WordToFoExtractor.processSectionParagraphes(WordToFoExtractor.java:608) at org.apache.poi.hwpf.extractor.WordToFoExtractor.processSection(WordToFoExtractor.java:556) at org.apache.poi.hwpf.extractor.WordToFoExtractor.processDocument(WordToFoExtractor.java:341) at org.apache.poi.hwpf.extractor.WordToFoExtractor.process(WordToFoExtractor.java:167) at org.apache.poi.hwpf.extractor.WordToFoExtractor.main(WordToFoExtractor.java:142) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Shall I commit this patch or wait for the next one? Yegor (In reply to comment #19) > Created attachment 27205 [details] > Workaround for NPE in Bug 33519
Created attachment 27207 [details] Latest patch Okey, here the latest patch. All tests passed. The problem arised by Bug 33519 is not solved - there is just a workaround. It seems like CHPX are NOT SORTED, so it is not correct to assume it in Range class, so _charStart, _charEnd shall be removed and all code linked to those fields shall be rewritten. It's a big task so i would like to have some kind of confirmation if my assumption about missing CHPX order is correct. In addition (in fact, main part of) this patch includes doc-to-html extractor.
Created attachment 27208 [details] Latest patch
I'm closing this bug because all patches applied. Newly patches can be applied to SVN (have commiter access now). Patches from other users are still welcome (as new issues).