Uploaded image for project: 'XMLBeans'
  1. XMLBeans
  2. XMLBEANS-295

setLoadStripWhitespace() api errors when trimming white space characters

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Version 2.2.1
    • Fix Version/s: TBD
    • Component/s: Validator
    • Labels:
      None
    • Environment:
      SunOS 5.9 and Microsoft Windows XP SP2, Java 1.4.2

      Description

      Situation Summary

      We implemented to production using the setLoadStripWhitespace() api in XMLBeans. After some days we started getting intermittent failures from occasional XML transactions.

      After a week of investigation we realized that flushText() method itself was the cause - having eliminated all other factors. Specifically we have determined that character strings containing the & character result in spaces being stripped immediately after the & - e.g. <company>B & H Photo</company> becomes <company>B &H Photo</company>.

      We realize that there is a patch available for & processing - and we are currently testing that to see if is cures the problem relating to & (http://issues.apache.org/jira/browse/XMLBEANS-274 )

      However we are also seeing an intermittent problem in our UNIX environment associated with colon : (could be other characters as well - we do not have definitive list). What we found is intermittent spaces being trimmed in various fields that do not contain "&" (the original XMLBEAN-274 bug reported). This one we cannot reproduce in our Windows development systems - but it is happening intermittently in SunOS.

      Again space either immediately following the colon or in subsequent string is stripped - for tokenized elements - e.g. <urgent>Yes: Y</urgent> becomes <urgent>Yes:Y</urgent> and then the object returns NULL value because this is then not a valid allowed value for the tokenized list. Similarly <location>USA: United States</location> became <location>USA: UnitedStates</location>. We suspect that there is a prior character before the colon that might be triggering this behaviour but we have not yet determined when or how. This illustrates how complex this issue is in terms of the current XMLBeans implementation approach.

      Analysis

      We have looked at how and where XMLBeans is doing the white space trim during the unmarshalling of the XML content. When it detects a white space - it then invokes a stripRight() method loop. We are not convinced that this is architecturally sound at the point it is employed - it is leading to complexity and obviously a lot of edge conditions and some combinations of characters that are not handled consistently and correctly.

      Our preferred approach would be to defer the white space trim until post-unmarshalling - so the initial process can treat the XML content "as is" between the angle brackets - then once extracted - then apply the trim(). At that point a simple java string object trim() can be employed. This could be provided as an alternate method call to the current setLoadStripWhitespace() api that would iterate through the entire structure of objects instead of the original XML stream. The only check that would be necessary is if the XML markup itself set the xml:space="preserve" attribute option for an element object - in which case the trim() would be automatically skipped for that content object item. What is happening right now is that the existing flushText() method is mixing up XML markup and the content - instead there needs to be a clear separation between the element angle brackets and attribute quotes - and the content itself.

      Again the caveat maybe here - maybe the current approach is intended to be prior to error checking on tokenized lists - to prevent failure there due to extra spaces? However - even so it is not cleanly enough separated - and clearly again it would be simpler to use a java string class trim method within the tokenized evaluation itself on just the string.

      Suggested Solution

      Re-factor the current white space setLoadStripWhitespace() api to delay string manipulation on content until after unpacking of the content and XML markup - instead of prior-to as is currently happening. This makes for much simpler white space trim logic (can simply use the Java string class method) that does not need to look for markup artifacts as well.

      We are not clear on who owns this particular feature in XMLBeans - whether they are currently available to assist on this - but we would be prepared to work with the team to develop a better solution here.

        Attachments

          Activity

            People

            • Assignee:
              cezar Cezar Cristian Andrei
              Reporter:
              drrwebber David RR Webber
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: