51351 – New Doc to FO extractor

Bug 51351 - New Doc to FO extractor

Summary: New Doc to FO extractor

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	HWPF (show other bugs)
Version:	3.8-dev
Hardware:	All All

Importance:	P2 enhancement (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-06-09 21:01 UTC by Sergey Vladimirov
Modified:	2011-07-04 19:53 UTC (History)
CC List:	0 users

Attachments
patch (33.59 KB, patch) 2011-06-09 21:01 UTC, Sergey Vladimirov	Details \| Diff
Updated patch (31.53 KB, patch) 2011-06-13 23:09 UTC, Sergey Vladimirov	Details \| Diff
Add most of images handling, except cropping (16.53 KB, patch) 2011-06-14 14:58 UTC, Sergey Vladimirov	Details \| Diff
new patch (51.97 KB, patch) 2011-06-20 09:05 UTC, Sergey Vladimirov	Details \| Diff
Additional test docs (7.88 KB, application/zip) 2011-06-20 09:07 UTC, Sergey Vladimirov	Details
Patch to fix ListEntryNoListTable and MBD001D0B89 tests; additional tests (6.54 KB, patch) 2011-06-23 08:11 UTC, Sergey Vladimirov	Details \| Diff
Workaround for NPE in Bug 33519 (5.21 KB, application/octet-stream) 2011-06-24 10:08 UTC, Sergey Vladimirov	Details
Workaround for NPE in Bug 33519 (5.13 KB, patch) 2011-06-25 14:02 UTC, Sergey Vladimirov	Details \| Diff
Latest patch (138.40 KB, patch) 2011-06-27 09:15 UTC, Sergey Vladimirov	Details \| Diff
Latest patch (153.99 KB, patch) 2011-06-27 09:37 UTC, Sergey Vladimirov	Details \| Diff
Show Obsolete (8) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sergey Vladimirov 2011-06-09 21:01:28 UTC

There is extractor (converter?) from DOC to FO in hdf package, but it's not based on HWPF code. Neither support images or customization.

in patch new extractor is proposed:
 - it is based on HWPF code
 - it is using DOM creation of FO document, not string building
 - with correct implementation of ImageHandler it can even convert MathType equations (i.e. extract WMF of those and let your ImageHandler do everything) 

Some things are not tested yet (for example, images, shapes or nested tables), but current code already creates nice PDF documents (with Apache FOP)

Comment 1 Sergey Vladimirov 2011-06-09 21:01:59 UTC

Created attachment 27143 [details]
patch

Comment 2 Yegor Kozlov 2011-06-10 07:19:27 UTC

Thanks for the patch.

Can you upload some samples so we can test your code? It would be nice to see a source .doc file, the produced XSL-FO and the resulting PDF. 

How do you run FOP, via command line? Please post the exact command.

Regards,
Yegor

Comment 3 Yegor Kozlov 2011-06-11 15:58:55 UTC

How much efforts are you going to invest in this utility? In its current form it is rather a proof of concept than a full-featured convertor. 

I was able to generate XSL-FO for some files from our collection of test Word documents. FOP 1.0 renders simple files, but stumbles on more complex ones:

Several times I've seen this:

Caused by: org.xml.sax.SAXParseException: Character reference "&#12" is an invalid XML character.
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)

and this:

javax.xml.transform.TransformerException: org.apache.fop.fo.ValidationException: Border and padding for fo:region-body "xsl-region-body" sho
ld be '0' (See 6.4.14 in XSL 1.1); non-standard values are allowed if relaxed validation is enabled.  (See position 5:411)
       at org.apache.fop.cli.InputHandler.transformTo(InputHandler.java:302)
       at org.apache.fop.cli.InputHandler.renderTo(InputHandler.java:130)
       at org.apache.fop.cli.Main.startFOP(Main.java:174)

Also, I see warnings in the console:

Jun 11, 2011 7:19:10 PM org.apache.fop.events.LoggingEventListener processEvent
SEVERE: Invalid property value encountered in linefeed-treatment="false": org.apache.fop.fo.expr.PropertyException: file:/C:/temp/DiffFirstP
ageHeadFoot.xml:9:38: No conversion defined false; property:'linefeed-treatment' (See position 10:520)
Jun 11, 2011 7:19:10 PM org.apache.fop.events.LoggingEventListener processEvent
SEVERE: Invalid property value encountered in linefeed-treatment="false": org.apache.fop.fo.expr.PropertyException: file:/C:/temp/DiffFirstP
ageHeadFoot.xml:9:38: No conversion defined false; property:'linefeed-treatment' (See position 16:520)
Jun 11, 2011 7:19:10 PM org.apache.fop.events.LoggingEventListener processEvent
SEVERE: Invalid property value encountered in keep-together.within-page="true": org.apache.fop.fo.expr.PropertyException: file:/C:/temp/Diff
FirstPageHeadFoot.xml:9:38: No conversion defined true; property:'keep-together.within-page' (See position 20:578)
Jun 11, 2011 7:19:10 PM org.apache.fop.events.LoggingEventListener processEvent
SEVERE: Invalid property value encountered in keep-with-next.within-page="true": org.apache.fop.fo.expr.PropertyException: file:/C:/temp/Dif
fFirstPageHeadFoot.xml:9:38: No conversion defined true; property:'keep-with-next.within-page' (See position 20:578)
Jun 11, 2011 7:19:10 PM org.apache.fop.events.LoggingEventListener processEvent 

WordToFOExtractor is a very promising feature, the question is where it should live: in poi-examples as a simple demo, or with the rest of HWPF code as a well-tested code.

Regards,
Yegor

Comment 4 Sergey Vladimirov 2011-06-13 21:42:36 UTC

Yegor,

I will test and upload updated patch and examples using test doc files from POI collections of test files.

Currently FOP is called using additional bunch of code, including image handling. This is tightly linked to our internal system (for example, it's includes converting images from WMF to SVG format), so extracting an example is tricky.

This is currently proof-of-concept code, but i believe it will be part of production system in several month. Also I believe functionality of new extractor is better (at least not worse) then old org.apache.poi.hdf.extractor package.

Personally I intent to continue supporting this code until it will be ready for production usage and may be implement xls-to-fo extractor. My goal to create one-way converter with maximum readability, i.e. without lost text but may be without some formatting.

Should I also notice that doc-to-html shall be easy to implement ? ;)

Regards,
Sergey

Comment 5 Sergey Vladimirov 2011-06-13 23:09:58 UTC

Created attachment 27153 [details]
Updated patch

Fixed "keep" attributes;
Removed ImageHandler interface (handle images by extending extractor)

Comment 6 Sergey Vladimirov 2011-06-13 23:18:00 UTC

POI doc sources, fo xml and PDF results
http://www.sendspace.com/file/sf4o24
2.5 Mb in size

Comment 7 Yegor Kozlov 2011-06-14 09:14:02 UTC

Patch applied in r1135414. I added  Java main() to WordToFoExtractor, this is how I tested your code:

Usage: WordToFoExtractor <inputFile.doc> <saveTo.fo>

Except for formatting and indentation, the output is identical to the XML in the uploaded archive. 

> 
> Personally I intent to continue supporting this code until it will be ready for
> production usage and may be implement xls-to-fo extractor. My goal to create
> one-way converter with maximum readability, i.e. without lost text but may be
> without some formatting.
> 

Great! It will be a really valuable contribution.

A xls-to-fo converter has been asked several times on the mailing lists. We already have ToCSV and ToHtml apps in poi-examples and  xls-to-fo can borrow/ share code from them.

> Should I also notice that doc-to-html shall be easy to implement ? ;)
> 

That would be nice to have too. 

P.S. I'm leaving this ticket open for updates. Close it if you prefer to upload new patches in a new ticket.

Regards,
Yegor

Comment 8 Sergey Vladimirov 2011-06-14 14:58:59 UTC

Created attachment 27155 [details]
Add most of images handling, except cropping

Added all possible image handling, except cropping - i can't find a way to obtain this information, neither pictures SPRM.

See testPictures.doc.pdf and testCroppedPictures.doc.pdf for examples.

http://www.sendspace.com/file/adie2l

Comment 9 Yegor Kozlov 2011-06-15 11:49:17 UTC

Applied in r1136001

It looks like you attached not the most recent version of your code - createExternalGraphic is never called and the fo:external-graphic element is missing in the output. No prob - I expect it in next patches. 

Regards,
Yegor

Comment 10 Sergey Vladimirov 2011-06-15 12:05:49 UTC

Yegor,

Graphic handling won't be part of extractor code. It's a lot of additional code AND additional libraries like Apache Batik or even ImageMagic calls. Also file creation and cleaning up should be coded.

So there is an empty processImage() method that should be implemented in subclass if anyone want image to be included in XSL FO. createExternalGraphic() and setImageProperties() are helper methods for those people.

Comment 11 Yegor Kozlov 2011-06-16 08:03:03 UTC

> 
> Graphic handling won't be part of extractor code. It's a lot of additional code
> AND additional libraries like Apache Batik or even ImageMagic calls. Also file
> creation and cleaning up should be coded.
> 
> So there is an empty processImage() method that should be implemented in
> subclass if anyone want image to be included in XSL FO. createExternalGraphic()
> and setImageProperties() are helper methods for those people.

I see, but we can provide default support for png/jpeg with minimum efforts! I added the following code and it worked for me:


    protected void processImage(Element currentBlock, boolean inlined,
            Picture picture) {

        byte[] bytes = picture.getContent();
        String ext = picture.getMimeType();
        if(ext.equals("image/jpeg") || ext.equals("image/png")){
            File file = new File(picture.suggestFullFileName()); 

            try {
                // dump images in the work dir 
                FileOutputStream out = new FileOutputStream(file);
                out.write(bytes);
                out.close();

                Element graphics = createExternalGraphic(file.toURI().toASCIIString());
                WordToFoUtils.setPictureProperties(picture, graphics);
                currentBlock.appendChild(graphics);

            } catch (IOException e){
                e.printStackTrace();
            }
        }

    }

I agree that handling other mimetypes is not trivial and may involve third-party libraries, but jpeg and png are most commons and should be supported by default.

Does it make sense for you?

Regards,
Yegor

Comment 12 Sergey Vladimirov 2011-06-16 09:39:26 UTC

Yegor,

Yes, the provided code can correctly words for png and jpeg images. i shall assume, FOP can also handle BMP, TIFFs and GIFs, so they can be listed there as well. May be even WMF, according to http://xmlgraphics.apache.org/fop/0.95/graphics.html (but i wouldn't advise to assume it).

But the main question is about cleaning up the files after work. Where those image files shall be stored? Who and when should delete them? What happens in can of exception with those files? (and for your code - what happens in case of parallel processing?)

Either this part is handled by external code, or it can be handled by Extractor code. In second case we will need some kind of close() or cleanup() method to delete those files after FOP processing.

Regards,
Sergey.

P.S.: I subscribed to poi-user/poi-dev mailists, so we can move discussion there.

Comment 13 Sergey Vladimirov 2011-06-20 09:05:50 UTC

Created attachment 27177 [details]
new patch

Add hyperlinks support;
Add common fields support;
Split WordToFoExtractor and AbstractToFoExtractor

Comment 14 Sergey Vladimirov 2011-06-20 09:07:28 UTC

Created attachment 27178 [details]
Additional test docs

Comment 15 Sergey Vladimirov 2011-06-23 08:11:35 UTC

Created attachment 27198 [details]
Patch to fix ListEntryNoListTable and MBD001D0B89 tests; additional tests

Comment 16 Yegor Kozlov 2011-06-23 11:28:58 UTC

Applied in r1138836

Yegor

(In reply to comment #15)
> Created attachment 27198 [details]
> Patch to fix ListEntryNoListTable and MBD001D0B89 tests; additional tests

Comment 17 Yegor Kozlov 2011-06-24 08:51:47 UTC

I made a small change in TestWordToFoExtractorSuite and added an option to exclude certain files from the suite.

I resolved an old Bug 33519 which complained that HWPF failed on open a document and added the problematic file to our test collection. As result, TestWordToFoExtractorSuite started to fail on Bug33519.doc with a NPE:

java.lang.NullPointerException
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.processCharacters(WordToFoExtractor.java:255)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.processParagraph(WordToFoExtractor.java:492)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.processSectionParagraphes(WordToFoExtractor.java:571)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.processSection(WordToFoExtractor.java:519)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.processDocument(WordToFoExtractor.java:332)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.process(WordToFoExtractor.java:167)
 
Yegor

Comment 18 Sergey Vladimirov 2011-06-24 10:08:52 UTC

Created attachment 27204 [details]
Workaround for NPE in Bug 33519

(In reply to comment #17)
> I resolved an old Bug 33519 which complained that HWPF failed on open a
> document and added the problematic file to our test collection. As result,
> TestWordToFoExtractorSuite started to fail on Bug33519.doc with a NPE:

Wordkaround in proposed patch.

Comment 19 Sergey Vladimirov 2011-06-25 14:02:23 UTC

Created attachment 27205 [details]
Workaround for NPE in Bug 33519

Comment 20 Yegor Kozlov 2011-06-26 10:27:08 UTC

It still fails on Bug33519.doc, but with a different exception:

java.lang.IllegalArgumentException: The end (1077) must not be before the start (1985)
	at org.apache.poi.hwpf.usermodel.Range.sanityCheckStartEnd(Range.java:243)
	at org.apache.poi.hwpf.usermodel.Range.<init>(Range.java:176)
	at org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:97)
	at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:802)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.processCharacters(WordToFoExtractor.java:243)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.processParagraph(WordToFoExtractor.java:529)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.processSectionParagraphes(WordToFoExtractor.java:608)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.processSection(WordToFoExtractor.java:556)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.processDocument(WordToFoExtractor.java:341)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.process(WordToFoExtractor.java:167)
	at org.apache.poi.hwpf.extractor.WordToFoExtractor.main(WordToFoExtractor.java:142)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)


Shall I commit this patch or wait for the next one?

Yegor

(In reply to comment #19)
> Created attachment 27205 [details]
> Workaround for NPE in Bug 33519

Comment 21 Sergey Vladimirov 2011-06-27 09:15:22 UTC

Created attachment 27207 [details]
Latest patch

Okey, here the latest patch. All tests passed.

The problem arised by Bug 33519 is not solved - there is just a workaround. It seems like CHPX are NOT SORTED, so it is not correct to assume it in Range class, so _charStart, _charEnd shall be removed and all code linked to those fields shall be rewritten.

It's a big task so i would like to have some kind of confirmation if my assumption about missing CHPX order is correct.

In addition (in fact, main part of) this patch includes doc-to-html extractor.

Comment 22 Sergey Vladimirov 2011-06-27 09:37:54 UTC

Created attachment 27208 [details]
Latest patch

Comment 23 Sergey Vladimirov 2011-07-04 19:52:49 UTC

I'm closing this bug because all patches applied. Newly patches can be applied to SVN (have commiter access now). Patches from other users are still welcome (as new issues).