[XMLBEANS-295] setLoadStripWhitespace() api errors when trimming white space characters - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Version 2.2.1
Fix Version/s: TBD
Component/s: Validator
Labels:
None
Environment:
SunOS 5.9 and Microsoft Windows XP SP2, Java 1.4.2

Description

Situation Summary

We implemented to production using the setLoadStripWhitespace() api in XMLBeans. After some days we started getting intermittent failures from occasional XML transactions.

After a week of investigation we realized that flushText() method itself was the cause - having eliminated all other factors. Specifically we have determined that character strings containing the & character result in spaces being stripped immediately after the & - e.g. <company>B & H Photo</company> becomes <company>B &H Photo</company>.

We realize that there is a patch available for & processing - and we are currently testing that to see if is cures the problem relating to & (http://issues.apache.org/jira/browse/XMLBEANS-274 )

However we are also seeing an intermittent problem in our UNIX environment associated with colon : (could be other characters as well - we do not have definitive list). What we found is intermittent spaces being trimmed in various fields that do not contain "&" (the original XMLBEAN-274 bug reported). This one we cannot reproduce in our Windows development systems - but it is happening intermittently in SunOS.

Again space either immediately following the colon or in subsequent string is stripped - for tokenized elements - e.g. <urgent>Yes: Y</urgent> becomes <urgent>Yes:Y</urgent> and then the object returns NULL value because this is then not a valid allowed value for the tokenized list. Similarly <location>USA: United States</location> became <location>USA: UnitedStates</location>. We suspect that there is a prior character before the colon that might be triggering this behaviour but we have not yet determined when or how. This illustrates how complex this issue is in terms of the current XMLBeans implementation approach.

Analysis

We have looked at how and where XMLBeans is doing the white space trim during the unmarshalling of the XML content. When it detects a white space - it then invokes a stripRight() method loop. We are not convinced that this is architecturally sound at the point it is employed - it is leading to complexity and obviously a lot of edge conditions and some combinations of characters that are not handled consistently and correctly.

Our preferred approach would be to defer the white space trim until post-unmarshalling - so the initial process can treat the XML content "as is" between the angle brackets - then once extracted - then apply the trim(). At that point a simple java string object trim() can be employed. This could be provided as an alternate method call to the current setLoadStripWhitespace() api that would iterate through the entire structure of objects instead of the original XML stream. The only check that would be necessary is if the XML markup itself set the xml:space="preserve" attribute option for an element object - in which case the trim() would be automatically skipped for that content object item. What is happening right now is that the existing flushText() method is mixing up XML markup and the content - instead there needs to be a clear separation between the element angle brackets and attribute quotes - and the content itself.

Again the caveat maybe here - maybe the current approach is intended to be prior to error checking on tokenized lists - to prevent failure there due to extra spaces? However - even so it is not cleanly enough separated - and clearly again it would be simpler to use a java string class trim method within the tokenized evaluation itself on just the string.

setLoadStripWhitespace() api errors when trimming white space characters

Details

Description

Attachments

Activity

People

Dates