Bug 54849 - Controlled content/Form (Std/StdBlock) content is not processed
Summary: Controlled content/Form (Std/StdBlock) content is not processed
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.9-FINAL
Hardware: PC All
: P2 normal with 3 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-16 01:40 UTC by Tim Allison
Modified: 2013-06-20 00:11 UTC (History)
1 user (show)



Attachments
example file with content that cannot be extracted with 3.9-stable (25.92 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2013-04-16 01:40 UTC, Tim Allison
Details
[PATCH] POI 54849 (20.92 KB, patch)
2013-04-16 17:14 UTC, Tim Allison
Details | Diff
Round 2 patch, with interfaces and room to grow (47.98 KB, patch)
2013-05-10 01:38 UTC, Tim Allison
Details | Diff
Round 3 patch (54.26 KB, patch)
2013-05-16 17:34 UTC, Tim Allison
Details | Diff
Round 4 patch (11.43 KB, patch)
2013-06-14 17:28 UTC, Tim Allison
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2013-04-16 01:40:12 UTC
Created attachment 30199 [details]
example file with content that cannot be extracted with 3.9-stable

Some docx files with controlled content/forms content are not being properly processed.  The result is that it is not possible to extract text from the controlled content fields currently (at least with stable 3.9).  

The technical issue is that some docx files place the content paragraphs under the SdtBlock.  When loading the file, the cursor only looks for paragraphs and tables right below body; the current cursor is not configured to look for paragraphs and/or tables below the first level below body.
Comment 1 Tim Allison 2013-04-16 01:43:09 UTC
A quick fix is to add:


else if ( o instanceof CTSdtBlock){
                	CTSdtBlock block = (CTSdtBlock)o;
                	CTSdtContentBlock content = block.getSdtContent();
                	List<CTP> ctps = content.getPList();
                	
                	for (CTP ctp : ctps){
                		 XWPFParagraph p = new XWPFParagraph((CTP) ctp, this);
                         bodyElements.add(p);
                         paragraphs.add(p);
                	}
                	List<CTTbl> cttables = content.getTblList();
                	for (CTTbl table : cttables){
                        XWPFTable t = new XWPFTable((CTTbl) table, this);
                        bodyElements.add(t);
                        tables.add(t);
                	}
                }

after:
 else if (o instanceof CTTbl) {
                    XWPFTable t = new XWPFTable((CTTbl) o, this);
                    bodyElements.add(t);
                    tables.add(t);
                }

in XWPFDocument's onDocumentRead

If someone can help me add a few files to poi-ooxml-schemas from the larger ooxml-schemas-1.1.jar, I'll be happy to submit a formal patch.
Comment 2 Tim Allison 2013-04-16 17:14:18 UTC
Created attachment 30203 [details]
[PATCH] POI 54849

Proposed patch is attached.  In the long run, it'd be great to build out the controlled content/form object (on par with XWPFParagraph and XWPFTable), but for now, this will at least read the text within one into the paragraphs, tables and bodyelements lists.

Any recommendations?
Comment 3 Nick Burch 2013-04-16 21:03:49 UTC
(In reply to comment #1)
 > If someone can help me add a few files to poi-ooxml-schemas from the larger
> ooxml-schemas-1.1.jar, I'll be happy to submit a formal patch.

Add a unit test which causes the extra CT classes to be used, and they'll be automatically included when you run the jar ant task
Comment 4 Nick Burch 2013-04-16 21:06:56 UTC
(In reply to comment #2)
> Created attachment 30203 [details]
> [PATCH] POI 54849
> 
> Proposed patch is attached.  In the long run, it'd be great to build out the
> controlled content/form object (on par with XWPFParagraph and XWPFTable),
> but for now, this will at least read the text within one into the
> paragraphs, tables and bodyelements lists.
> 
> Any recommendations?

I think it might be best to add a very simple body element for the controlled content, and have the inner paragraphs+tables made available from that. That would seem to be the most future proof way to go, as we can then add in future logic / information on the controlled content later, without changing the behaviour. If we do a simple parsing now and pop the text into the regular list, then when we come to add proper support we'd risk breaking existing code by causing the location of the text objects to move in the body hierarchy.
Comment 5 Tim Allison 2013-04-16 23:19:05 UTC
Agreed that that is much cleaner. XWPFControlledContent implements IBody?(In reply to comment #3)
> (In reply to comment #1)
>  > If someone can help me add a few files to poi-ooxml-schemas from the
> larger
> > ooxml-schemas-1.1.jar, I'll be happy to submit a formal patch.
> 
> Add a unit test which causes the extra CT classes to be used, and they'll be
> automatically included when you run the jar ant task
Thank you!
Comment 6 Tim Allison 2013-04-16 23:25:55 UTC
(In reply to comment #4)
> (In reply to comment #2)
> > Created attachment 30203 [details]
> > [PATCH] POI 54849
> > 
> > Proposed patch is attached.  In the long run, it'd be great to build out the
> > controlled content/form object (on par with XWPFParagraph and XWPFTable),
> > but for now, this will at least read the text within one into the
> > paragraphs, tables and bodyelements lists.
> > 
> > Any recommendations?
> 
> I think it might be best to add a very simple body element for the
> controlled content, and have the inner paragraphs+tables made available from
> that. That would seem to be the most future proof way to go, as we can then
> add in future logic / information on the controlled content later, without
> changing the behaviour. If we do a simple parsing now and pop the text into
> the regular list, then when we come to add proper support we'd risk breaking
> existing code by causing the location of the text objects to move in the
> body hierarchy.

Agreed. Makes it much cleaner.  

Not much time to burn on this...maybe by end of next week for first draft.
Comment 7 Tim Allison 2013-04-23 12:51:49 UTC
(In reply to comment #4)
> (In reply to comment #2)
> > Created attachment 30203 [details]
> > [PATCH] POI 54849
> > 
> > Proposed patch is attached.  In the long run, it'd be great to build out the
> > controlled content/form object (on par with XWPFParagraph and XWPFTable),
> > but for now, this will at least read the text within one into the
> > paragraphs, tables and bodyelements lists.
> > 
> > Any recommendations?
> 
> I think it might be best to add a very simple body element for the
> controlled content, and have the inner paragraphs+tables made available from
> that. That would seem to be the most future proof way to go, as we can then
> add in future logic / information on the controlled content later, without
> changing the behaviour. If we do a simple parsing now and pop the text into
> the regular list, then when we come to add proper support we'd risk breaking
> existing code by causing the location of the text objects to move in the
> body hierarchy.

Agree on future-proofing.  I'm trying to think of the smallest changes possible as a first step.  It looks like:

New classes:
XWPFControlledContent implements IBodyElement
XWPFControlledContentBlock implements IBody

Add entries to BodyType and ElementType

Add to IBody interface:
getControlledContents()
getControlledContentArray()

I think we'll need to make changes to all classes that implement IBody's getBodyElements().

For all classes that implement IBody, we'll need to add the new methods.

Does this seem like a reasonable first step?  
Is there a simpler option?
Is it ok if I leave stubs for dynamically adding controlled content (I propose "read only" from an existing docx for this patch).
Comment 8 Tim Allison 2013-04-25 10:56:36 UTC
We'll also have to make XWPFControlledContent equivalent to a run.
Comment 9 Nick Burch 2013-05-06 16:49:57 UTC
For an easy start, don't implement getControlledContents() and friends - just rely on people finding them from the IBody getBodyElements() method. That should make things simpler and minimise the code changes needed to get started. Otherwise, looks a good plan!
Comment 10 Tim Allison 2013-05-10 01:38:56 UTC
Created attachment 30268 [details]
Round 2 patch, with interfaces and room to grow

Tried to modify as little as possible.  Still wound up modifying more that I would have liked.  Plenty of room to clean up the stub interfaces.  Recommendations?
Comment 11 Tim Allison 2013-05-16 17:34:21 UTC
Created attachment 30289 [details]
Round 3 patch

Added headers, footers, footnotes and endnotes.
Comment 12 Tim Allison 2013-05-16 17:38:32 UTC
Comment on attachment 30289 [details]
Round 3 patch

updated metadata = patch
Comment 13 Tim Allison 2013-06-05 23:47:52 UTC
This issue was just raised as a bug in Tika (https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel).  If anyone has some time to review my patch, that would be much appreciated.  Thank you!
Comment 14 Nick Burch 2013-06-13 11:34:12 UTC
I've finally had a chance to review this, and on the whole it looks great, thanks!

A couple of minor points though:
 * We don't normally use @author tags in the code. The code is maintained, supported and extended by the whole community, so it's best not to give the impression that only one person "owns" a bit of code
 * It's generally best to have whitespace fixing up in a different patch/commit (where possible). Otherwise, it's harder to review the patch for the code changes, as they end up lost amidst the noise...

I'm minded to try to fix the whitespace stuff, commit that, then ask you to svn up and then produce an updated patch with just the logic changes in it. Would that work ok for you?
Comment 15 Tim Allison 2013-06-13 16:36:07 UTC
Nick, Great.  Thank you!

On the authorship attribution, I was following the checklist (http://poi.apache.org/guidelines.html) 
•the code includes the @author tag on any files you've altered or created.

I completely agree with you, though, and will remove author tags.

I don't wnat to take up your time with the spacing.  I'd like to get the spacing right.  Is the issue too many newlines/deleted new lines, or did I convert tabs to spaces or add too many spaces per line?  Or are both going on?  I'll repull and run the patch against the pull to see, but if you could offer feedback, that'd be great.

Finally, at the end of the Round 3 patch, I realized that I left one case not covered: a table cell can be equivalent to an SDT.  Should I add that in the same way that I've added the equivalence to paragraphs and runs?  Or should I save that for a follow on patch? My preference is that the current version (with space/author attrib fixed) be committed asap, and then I'll do some follow on work.  Let me know your preference.

Thank you, again.
Comment 16 Nick Burch 2013-06-13 18:19:52 UTC
Whoops, those guidelines are rather out of date... I've just had a quick stab at updating them based on what we've tended to do over the last couple of years. Hopefully that'll help!

I'll blitz through some of the more obvious whitespace bits shortly, then please re-do your patch if possible. XWPF has been contributed to by a wide range of people, and without code style documented it can be a bit of a hodge-podge!
Comment 17 Nick Burch 2013-06-13 18:53:32 UTC
As of r1492818, most of the more obvious javadoc mistakes and whitespace inconsistencies in the files you're working on should be fixed. Please rebase you patch on that and send us a new one, so we can commit the real logic!
Comment 18 Tim Allison 2013-06-14 17:28:46 UTC
Created attachment 30432 [details]
Round 4 patch

Added back logic.  Let me know if this works.  Thank you, Nick!
Comment 19 Nick Burch 2013-06-18 23:35:23 UTC
XWPFRun - had to keep getParagraph and the constructor to maintain backwards compatibility

Footnote doubling in xwpfdocument - I've kept that bit out. Will go into another commit shortly...

IRunBody + IRunElement + friends - javadoc moved to class, and expanded out

XWPFTable - removed commented system.err

Java 1.6-isms fixed

Thanks! Committed as described in r1494376.
Comment 20 Nick Burch 2013-06-18 23:40:07 UTC
Footnote doubling fix added in r1494379. Please could you update the associated bug(s) to mark them as resolved / flag them as needing their unit tests applying?
Comment 21 Tim Allison 2013-06-20 00:11:47 UTC
Nick,
  Thank you for your guidance, review and edits and, of course, for committing this patch!  I added a simple testcase for 55066 and marked it resolved.  If you could commit that testcase, that'd be great.  Apologies if I should have left it as "unit test needs to be applied."  Thank you, again.