51052 – NullPointerException with retrieve marker

Bug 51052 - NullPointerException with retrieve marker

Summary: NullPointerException with retrieve marker

Status:	NEW

Alias:	None

Product:	Fop - Now in Jira
Classification:	Unclassified
Component:	page-master/layout (show other bugs)
Version:	all
Hardware:	PC Linux

Importance:	P3 normal
Target Milestone:	---
Assignee:	fop-dev

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-04-12 05:09 UTC by Mehdi Houshmand
Modified:	2012-04-24 05:36 UTC (History)
CC List:	0 users

Attachments
NullPointerException (1.31 KB, text/x-xslfo) 2011-04-12 05:09 UTC, Mehdi Houshmand	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mehdi Houshmand 2011-04-12 05:09:14 UTC

Created attachment 26875 [details]
NullPointerException 

The issue is that a NPE is thrown when a retrieve-marker retrieves a marker with only text content, when the retrieve-marker is a child of a table-cell. The real problem here is that there is no validation done on the retrieve marker to ensure that the markers child is a valid child of the retrieve markers parent (excuse the tongue twister), see XSL section 6.13.5:

"An fo:marker may contain any formatting objects that are permitted as a replacement of any fo:retrieve-marker or fo:retrieve-table-marker that retrieves the fo:marker's children."

Though the spec doesn't specify how to handle this error, I am proposing that rather than throw an NPE, FOP logs a warning and ignores the marker. The spec doesn't specifically suggest that this kind of error is recoverable, but others might agree that throwing an NPE isn't the "proper" way to deal with this error.

I've attached an example test FO and I'll attach a patch shortly.

Comment 1 Andreas L. Delmelle 2011-04-13 13:25:05 UTC

(In reply to comment #0)
> The issue is that a NPE is thrown when a retrieve-marker retrieves a marker
> with only text content, when the retrieve-marker is a child of a table-cell.
> The real problem here is that there is no validation done on the retrieve
> marker to ensure that the markers child is a valid child of the retrieve
> markers parent (excuse the tongue twister), see XSL section 6.13.5:

Indeed. Basic validation of child elements (i.e. FONode.validateChildNode()) is skipped when the marker is cloned. That seems like an oversight.

> Though the spec doesn't specify how to handle this error, I am proposing that
> rather than throw an NPE, FOP logs a warning and ignores the marker. 

Exactly. That would remain consistent with the behavior during normal FO tree building.
If some bare text is entered in a fo:table-cell (or basically any FO that expects block-content), then it will be ignored, rather than reported as an error. To remain consistent, we would have to match this behavior. So, if the child node is a FOText and the retrieve-marker's parent is not a FObjMixed, we should at most issue a warning, and assume 'no content'.

> The spec  doesn't specifically suggest that this kind of error is recoverable, but others
> might agree that throwing an NPE isn't the "proper" way to deal with this
> error.

Definitely. Rule of thumb: a NPE is *always* wrong. :-)

> 
> I've attached an example test FO and I'll attach a patch shortly.

Thanks!

Comment 2 Andreas L. Delmelle 2011-04-13 13:35:07 UTC

(In reply to comment #1)
> ... If some bare text is entered in a fo:table-cell (or basically any FO that
> expects block-content), then it will be ignored, rather than reported as an
> error...

Note: obviously only if the table-cell also contains at least an empty block. Otherwise, an error will be thrown, but it will be a complaint about an *empty* table-cell...

That might complicate matters slightly, as this case will also have to be reported as such if it happens during marker-retrieval.

Comment 3 Mehdi Houshmand 2011-04-14 04:08:22 UTC

After further looking into this, it appears as if the validateChildNode() was never intended to validate #PCDATA (or in our its object representation FOText). I say this because (as Andreas eluded to) FObjMixed is the progenitor of all the classes that create FOText. And these are the only objects that can accept #PCDATA as a child, so no validation really needs to be done on FOText, with respect to validateChildNode().

However, o.a.f.fo.flow.Marker inherits from FObjMixed, and so when Marker.characters() is called, it blindly creates an FOText object without checking the validity of the #PCDATA.

One solution would be to store the character array (or as a String) and postpone the creation of the FOText node, until the Marker is retrieved by the retrieve marker, and thus validation can be done.

The point at which both the marker and the retrieve marker are both accessible seems to be in AbstractPageSequenceLayoutManager.resolveRetrieveMarker(). Arguably, this object shouldn't be responsible for validation, so if we inspect resolveRetrieveMarker(), we find that AbstractRetrieveMarker.bind() is responsible for binding a Marker to a RetrieveMarker. It is possible that, just prior to this, we validate the children of the marker and either invoke the creation of an FOText object or trigger an event to display the error.

I thought this would be a quick fix, but now I'm not so sure. I don't know what the ramifications of postponing the creation of the FOText node will have. Nor if it's possible to postpone the creation since FObjMixed.flushText() seems to be responsible for binding the FOText object to a Block, and it's class private.

I'm afraid I don't have the time at the moment to investigate this further, but if and when I come back to it, or if anyone else want's to take a stab at this, it may save some time.

Comment 4 Andreas L. Delmelle 2011-04-14 14:45:35 UTC

(In reply to comment #3)

> However, o.a.f.fo.flow.Marker inherits from FObjMixed, and so when
> Marker.characters() is called, it blindly creates an FOText object without
> checking the validity of the #PCDATA. 

Yep, and at that point you cannot judge whether this will be appropriate, or not, in the retrieval context. It could be correct for some pages, but wrong on others, depending on where the retrieve-marker appears.

> One solution would be to store the character array (or as a String) and
> postpone the creation of the FOText node, until the Marker is retrieved by the
> retrieve marker, and thus validation can be done. 

That might be an option. Store the char array, and later on, trigger characters() on the retrieve-marker, if appropriate.
However, I believe it is not strictly necessary. Also, the current approach of constructing the FOText early and using clones later, has the convenient side-effect that the text nodes are stored in sequence with the rest of the marker's child nodes.

Ultimately, FOText _is_ already basically just a CharBuffer with some extra information.
The only real gain would be that the FOText property references can be avoided, so it might still be worthwhile to investigate, but more as a performance enhancement.

As I see it, what we definitely lack:
- a good/decent way to detect if a FO can have text children; I do not particularly like 'instanceof FObjMixed', since that does not cover possible extension classes that subclass FObj directly
- (more general) proper validation of the marker's immediate children against the parent of the retrieve-marker

The first would solve this particular bug, as it was first reported. The NPE can be avoided simply by ignoring the FOText (see further below).
The second would be more comprehensive, to avoid similar issues in the future.

> ... It is possible that, just prior to this, we validate the children of the
> marker and either invoke the creation of an FOText object or trigger
> an event to display the error.

Yes, and strictly speaking, this only needs to happen for the first level, as the other levels are already taken care of during normal FO tree building. 
That is, if you have:

<marker>
  <inline><block>Some text</block></inline>
</marker>

The only validation that is not done, is the inline against the parent of the retrieve-marker. The block will have been validated for the parent inline, though, so there is no need to repeat that step for every retrieval.

> I thought this would be a quick fix, but now I'm not so sure.

If someone really needs a quick fix:

* In AbstractRetrieveMarker cloneSingleNode(), before all else, first check

        if (newParent == this && child instanceof FOText
                && !(parent instanceof FObjMixed)) {
            return;
        }

* and add a similar condition --if (parent instanceof FObjMixed) -- 
   around the call to handleWhiteSpaceFor() in cloneFromMarker().

Tested and confirmed that the attachment just renders as a blank page, without any complaints whatsoever. No more NPE, but this does not address the key issues mentioned above... :-/

Comment 5 Mehdi Houshmand 2011-04-15 04:40:37 UTC

> That might be an option. Store the char array, and later on, trigger
> characters() on the retrieve-marker, if appropriate.
> However, I believe it is not strictly necessary. Also, the current approach of
> constructing the FOText early and using clones later, has the convenient
> side-effect that the text nodes are stored in sequence with the rest of the
> marker's child nodes.

Well, I guess that would depend on how this was implemented. If we were being puritanical, one could argue that if FOText was an object representation of #PCDATA (which I'm pretty sure it is), then by creating a #PCDATA child in the FOTree, we are creating an invalid node. Regardless of whether we address this invalidation later or not, validation should be done before binding nodes to the FOTree; invalid nodes shouldn't be bound to the FOTree in the first place. One way to solve this, is by postponing the creation of the FOText while still keeping the char array. This means that the data is kept, while we're maintaining the validity of the FOTree. Though I appreciate, if we did this, we'd have to maintain the rest of the current behaviour so as not to introduce a regression.

> 
> Ultimately, FOText _is_ already basically just a CharBuffer with some extra
> information.
> The only real gain would be that the FOText property references can be avoided,
> so it might still be worthwhile to investigate, but more as a performance
> enhancement.

You may be right, it might be prohibitively complex doing it the way I'm suggesting. It'd have to be investigated, this this is less about a performance enhancement since as you said, the creation of an FOText object is cheap.

> 
> As I see it, what we definitely lack:
> - a good/decent way to detect if a FO can have text children; I do not
> particularly like 'instanceof FObjMixed', since that does not cover possible
> extension classes that subclass FObj directly
> - (more general) proper validation of the marker's immediate children against
> the parent of the retrieve-marker

Yeah I agree, I was disinclined to use "instanceof FObjMixed".

</snip>

Thanks for the assistance, but I think we're going to avoid the "quick fix". We've been bitten by them before.

Comment 6 Andreas L. Delmelle 2011-04-16 14:35:44 UTC

(In reply to comment #5)
> Well, I guess that would depend on how this was implemented. If we were being
> puritanical, one could argue that if FOText was an object representation of
> #PCDATA (which I'm pretty sure it is), then by creating a #PCDATA child in the
> FOTree, we are creating an invalid node. 

No, we're not. #PCDATA is always a valid child node for a marker (i.e. the content model for a marker is "(#PCDATA|%inline;|%block;)*"). 
It will only, potentially, /become/ invalid in the retrieval context.

In the same sense, one could reason that the marker's child blocks, tables, lists etc. all should not be created, and we should store the FO source rather than parsing them into FONodes.
There might be good points for, but also against.

Comment 7 Andreas L. Delmelle 2011-04-16 14:54:27 UTC

(In reply to comment #6)
> In the same sense, one could reason that the marker's child blocks, tables,
> lists etc. all should not be created, and we should store the FO source rather
> than parsing them into FONodes.
> There might be good points for, but also against.

Note: I am not necessarily against this myself. It would be pretty cool, actually, if we were to store only the raw FO source of the marker-subtree, in a CharBuffer, to be parsed later. At first glance, it could turn out to be slightly more efficient in terms of memory footprint. I'd need to see proof to be certain, but it might...

Comment 8 Andreas L. Delmelle 2011-04-16 16:06:21 UTC

(In reply to comment #6)
> > Well, I guess that would depend on how this was implemented. If we were being
> > puritanical, one could argue that if FOText was an object representation of
> > #PCDATA (which I'm pretty sure it is), then by creating a #PCDATA child in the
> > FOTree, we are creating an invalid node. 

> No, we're not. #PCDATA is always a valid child node for a marker (i.e. the
> content model for a marker is "(#PCDATA|%inline;|%block;)*"). 
> It will only, potentially, /become/ invalid in the retrieval context.

I suddenly realize this needs more explanation, as there is obviously the remark about the retrieve-marker's parent...

Looking at it from FOP's perspective, at parse-time (i.e. when the FO tree is built), there is no way to know when --or even if-- a marker will actually be retrieved. Granted, we _could_ decide to throw an error if there is even the smallest probability of a mismatch, but we would never know for certain whether it would actually cause an error. 
I am far from convinced that this justifies the added computational complexity of walking up the tree, and checking all static-contents for a retrieve-marker that _might_ retrieve a particular marker.

What I mean is: it is not incorrect/invalid to create the #PCDATA node as a child of the marker. However, to concede to your point, it is definitely incorrect to blindly copy it, and re-bind it to the wrong parent.

(In reply tom comment #7)
> Note: I am not necessarily against this myself. It would be pretty cool,
> actually, if we were to store only the raw FO source of the marker-subtree, in
> a CharBuffer, to be parsed later. At first glance, it could turn out to be
> slightly more efficient in terms of memory footprint. I'd need to see proof to
> be certain, but it might...

... and after some tests, I can see that this is definitely not always so cool. ;-)
A lot depends on the actual structure of the subtree. FO is quite verbose, so even a small table already costs quite some chars, which does not always weigh up to simply instantiating the FONodes to store the data.

If we're really serious about further optimization, then in terms of footprint, the most optimal situation may just be to create a generic MarkerDescendant node type, and convert those into the proper FONode subclass later, if and when they are actually retrieved.
That is: as opposed to the current approach of immediately instantiating the proper type at parse time, and cloning those instances later, when the area tree is built.

Strictly speaking, in the current process, some space is still taken up by the unused references for members in flow.Block, flow.Table, FOText... That space is actually wasted, since the specified properties/attributes are stored in a Map that is associated with the Marker. 
If we strip the MarkerDescendants to be lean, basic FObj instances, that might save some in larger documents with a lot of markers, especially if only a relatively small amount are actually retrieved.

Comment 9 Mehdi Houshmand 2011-04-18 03:42:47 UTC

> I am far from convinced that this justifies the added computational complexity
> of walking up the tree, and checking all static-contents for a retrieve-marker
> that _might_ retrieve a particular marker.

I am of the same frame of mind, I think that's neither feasible nor really necessary.

> A lot depends on the actual structure of the subtree. FO is quite verbose, so
> even a small table already costs quite some chars, which does not always weigh
> up to simply instantiating the FONodes to store the data.
> 
> If we're really serious about further optimization, then in terms of footprint,
> the most optimal situation may just be to create a generic MarkerDescendant
> node type, and convert those into the proper FONode subclass later, if and when
> they are actually retrieved.
> That is: as opposed to the current approach of immediately instantiating the
> proper type at parse time, and cloning those instances later, when the area
> tree is built.

I think implementing this would/could get fairly involved. I'm not sure an intermediary object would be the proper approach. If we want the FOTree to be an object manifestation of the FO, adding objects types that aren't a part of the FO spec would break it's adherence with the spec.

I hadn't considered that the markers can have a hierarchy of children, or at least I hadn't considered that this could be several levels deep rather than just #PCDATA or a single-level block or two. That would make postponing instantiating any children unnecessarily difficult. It would also mean that we're parsing FO in the layout manager, which I don't think there's a precedent for, nor is that the duty of the layout manager. I'm beginning to think the initial approach, validating in AbstractRetrieveMarker as it binds objects to itself may be the way to go. We may be forced to do an "instanceof FObjMixed", to validate #PCDATA separately.

> 
> Strictly speaking, in the current process, some space is still taken up by the
> unused references for members in flow.Block, flow.Table, FOText... That space
> is actually wasted, since the specified properties/attributes are stored in a
> Map that is associated with the Marker. 
> If we strip the MarkerDescendants to be lean, basic FObj instances, that might
> save some in larger documents with a lot of markers, especially if only a
> relatively small amount are actually retrieved.

That would mean adding an object that wasn't a part of the FO spec into the FOTree. I'm not sure that's a good idea. It would be fairly confusing for anyone looking at this at a later date, they'd see MarkerDescendants, refer to the spec only to find it not there. No amount of commenting code would prevent this.

Comment 10 Glenn Adams 2012-04-07 01:41:40 UTC

resetting P2 open bugs to P3 pending further review

Comment 11 Glenn Adams 2012-04-08 08:40:37 UTC

(In reply to comment #0)
> I've attached an example test FO and I'll attach a patch shortly.

mehdi, any chance you'll be posting a patch?

Comment 12 Mehdi Houshmand 2012-04-08 09:13:30 UTC

(In reply to comment #11)
> (In reply to comment #0)
> > I've attached an example test FO and I'll attach a patch shortly.
> 
> mehdi, any chance you'll be posting a patch?

No patch. I was just posting this as a known bug, quite nuanced, but as always, when I get time, I'll look further into this issue.