Xerces2-J
  1. Xerces2-J
  2. XERCESJ-589

Bug with pattern restriction on long strings

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.10.0
    • Labels:
      None
    • Environment:
      Operating System: All
      Platform: All

      Description

      There is a bug with applying a pattern restriction on long strings while trying
      to validate an XML file against a schema. I'm including an xml file and xsd
      file that demonstrates this problem. One character less in <sequence> and the
      problem does not occur.

      As it is, I'm getting
      java.lang.StackOverflowError
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      at
      org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
      ...

      1. ASF.LICENSE.NOT.GRANTED--test.xml
        2 kB
        Mark Woon
      2. ASF.LICENSE.NOT.GRANTED--test.xsd
        2 kB
        Mark Woon
      3. test.zip
        0.8 kB
        Andy O'Brien
      4. RegularExpression.java
        33 kB
        aaron pieper
      5. RegularExpression.java
        106 kB
        Nick Sydenham
      6. RegularExpression.java
        135 kB
        Geoff Granum

        Activity

        Mark Woon created issue -
        Hide
        Mark Woon added a comment -

        Created an attachment (id=4658)
        XML data file to validate that causes problem

        Show
        Mark Woon added a comment - Created an attachment (id=4658) XML data file to validate that causes problem
        Hide
        Mark Woon added a comment -

        Created an attachment (id=4659)
        XML schema file that causes the problem

        Show
        Mark Woon added a comment - Created an attachment (id=4659) XML schema file that causes the problem
        Hide
        Mark Woon added a comment -

        This problem also exists in earlier versions of Xerces.

        Show
        Mark Woon added a comment - This problem also exists in earlier versions of Xerces.
        Hide
        Tetsuya Yoshida added a comment -

        On the following condition, this problem is occurred.

        1. The definition which define the length of characters.
        2. Input target characters are long.
        3. Run validation twice or more over on the same JRE.
        4. XercesImpl.jar is in the endorsed directory.

        Therefore JDK1.3.x or earlier doesn't has endorsed framework, this problem is
        not occurred on JDK1.3.x or earlier.

        Show
        Tetsuya Yoshida added a comment - On the following condition, this problem is occurred. 1. The definition which define the length of characters. 2. Input target characters are long. 3. Run validation twice or more over on the same JRE. 4. XercesImpl.jar is in the endorsed directory. Therefore JDK1.3.x or earlier doesn't has endorsed framework, this problem is not occurred on JDK1.3.x or earlier.
        Hide
        Tetsuya Yoshida added a comment -

        Let me fix some information.

        Even if XercesImpl.jar is not in the endorsed directory, I got this problem. If
        XercesImpl.jar is not in the endorsed directory, I can validate more characters.
        In my case, the max size of chearacters in the element is 2000 characters.

        I also got this problem on JDK1.3.x or earlier.

        Show
        Tetsuya Yoshida added a comment - Let me fix some information. Even if XercesImpl.jar is not in the endorsed directory, I got this problem. If XercesImpl.jar is not in the endorsed directory, I can validate more characters. In my case, the max size of chearacters in the element is 2000 characters. I also got this problem on JDK1.3.x or earlier.
        Hide
        Mark Woon added a comment -

        Xerces has gone through two releases since this bug was first reported. Does
        anyone have any idea if/when this bug will be fixed?

        Show
        Mark Woon added a comment - Xerces has gone through two releases since this bug was first reported. Does anyone have any idea if/when this bug will be fixed?
        Hide
        Mark Woon added a comment -

        Any chance this bug will get some attention anytime soon?

        Show
        Mark Woon added a comment - Any chance this bug will get some attention anytime soon?
        Hide
        nddelima added a comment -

        Works for me with Xerces2J-2.6.0.

        Show
        nddelima added a comment - Works for me with Xerces2J-2.6.0.
        Serge Knystautas made changes -
        Field Original Value New Value
        issue.field.bugzillaimportkey 16628 20649
        Hide
        Andy O'Brien added a comment -

        I'm also getting StackOverflow exceptions for this issue. I'm using Xerces 2.6.2, Java 1.3.1_06

        Here's my element definition, where large values cause stack problems:

        <!-- Test digit tokens w/o beginning, trailing, or adjacent spaces -->

        <xsd:element name="foo">
        <xsd:simpleType>
        <xsd:restriction base="xsd:string">
        <xsd:pattern value="([0-9] ?)*[0-9]" />
        </xsd:restriction>
        </xsd:simpleType>
        </xsd:element>

        With 10,000 digits for <foo> above, I get a StackOverflow:

        Exception in thread "main" java.lang.StackOverflowError
        at org.apache.xerces.impl.xpath.regex.Op$UnionOp.elementAt(Unknown Sourc
        e)
        at org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unkn
        own Source)
        at org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unkn
        own Source)
        . . . . . .

        My attachment has a simple instance (10,000 digits for <foo> above) and its schema.

        A variable size causes exceptions.

        Note that another very popular GUI XML tool has problems here too. In their case, they seem to limit the data to 1000 chars. (for <foo> above). Maybe they're avoiding stack issues with such a static value?

        In another XML parser though, I get no stack overflow exceptions, nor am I limited in size (at least I haven't seen a problem with it yet).

        Sounds like this is still an issue.

        Show
        Andy O'Brien added a comment - I'm also getting StackOverflow exceptions for this issue. I'm using Xerces 2.6.2, Java 1.3.1_06 Here's my element definition, where large values cause stack problems: <!-- Test digit tokens w/o beginning, trailing, or adjacent spaces --> <xsd:element name="foo"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:pattern value="( [0-9] ?)* [0-9] " /> </xsd:restriction> </xsd:simpleType> </xsd:element> With 10,000 digits for <foo> above, I get a StackOverflow: Exception in thread "main" java.lang.StackOverflowError at org.apache.xerces.impl.xpath.regex.Op$UnionOp.elementAt(Unknown Sourc e) at org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unkn own Source) at org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unkn own Source) . . . . . . My attachment has a simple instance (10,000 digits for <foo> above) and its schema. A variable size causes exceptions. Note that another very popular GUI XML tool has problems here too. In their case, they seem to limit the data to 1000 chars. (for <foo> above). Maybe they're avoiding stack issues with such a static value? In another XML parser though, I get no stack overflow exceptions, nor am I limited in size (at least I haven't seen a problem with it yet). Sounds like this is still an issue.
        Andy O'Brien made changes -
        Attachment test.zip [ 19340 ]
        Hide
        Christiaan Janssen added a comment -

        I've run across this problem in every version of xerces (that I've used) to date. Looking at the code for the matchString function, the problem appears to be the recursive nature of the function. No matter what size the stack is set to, you always run into this problem if you supply a large enough string to parse. This is due to the function pulling a chunk off of the parse string and recursively calling the function on the remainder. Given that a string could be any length (ie 10,000 characters or even more) thats a lot of recursive calls. The only real solution to this problem is to rewrite the function in an iterative form thus alleviating the excessive usage of the stack.

        Show
        Christiaan Janssen added a comment - I've run across this problem in every version of xerces (that I've used) to date. Looking at the code for the matchString function, the problem appears to be the recursive nature of the function. No matter what size the stack is set to, you always run into this problem if you supply a large enough string to parse. This is due to the function pulling a chunk off of the parse string and recursively calling the function on the remainder. Given that a string could be any length (ie 10,000 characters or even more) thats a lot of recursive calls. The only real solution to this problem is to rewrite the function in an iterative form thus alleviating the excessive usage of the stack.
        Hide
        Christiaan Janssen added a comment -

        Looks like I may be wrong (happens a lot), apparently the recursivity of the function may not be the problem. When using JDOM (org.jdom.input.SAXBuilder) with validation turned on (as opposed to using javax.xml.parsers.SAXParser) the problem does not seem to come up. I've traced the code and as far as I can tell they are using the same underlying xerces parser.

        The xml content i'm trying to parse is only 2800 characters long but nonetheless seems to bomb out every time. It just seems weird that it doesn't fail using JDOM.

        If anyone wants the Schema (its multipart, 1 main xsd with 10 includes) and the XML file I'm using just let me know.

        Show
        Christiaan Janssen added a comment - Looks like I may be wrong (happens a lot), apparently the recursivity of the function may not be the problem. When using JDOM (org.jdom.input.SAXBuilder) with validation turned on (as opposed to using javax.xml.parsers.SAXParser) the problem does not seem to come up. I've traced the code and as far as I can tell they are using the same underlying xerces parser. The xml content i'm trying to parse is only 2800 characters long but nonetheless seems to bomb out every time. It just seems weird that it doesn't fail using JDOM. If anyone wants the Schema (its multipart, 1 main xsd with 10 includes) and the XML file I'm using just let me know.
        Hide
        Steve Handy added a comment -

        I ran into this issue today with Xerces 2.7.1 under JDK 1.5.0_01. The limit that I encountered is 1954 characters. I was using a multipart schema as well, which consisted of 1 main XSD with 17 includes.

        Show
        Steve Handy added a comment - I ran into this issue today with Xerces 2.7.1 under JDK 1.5.0_01. The limit that I encountered is 1954 characters. I was using a multipart schema as well, which consisted of 1 main XSD with 17 includes.
        Hide
        Lothar Krenzien added a comment -

        In Xerces 2.9 under JDK 1.6 the bug is still alaive

        But with the recent svn snapshot (xercesImpl-gump-06062007.jar) it seems to work but now the same execption occurs (for me) under some other conditions which I haven't evaluated yet.

        Show
        Lothar Krenzien added a comment - In Xerces 2.9 under JDK 1.6 the bug is still alaive But with the recent svn snapshot (xercesImpl-gump-06062007.jar) it seems to work but now the same execption occurs (for me) under some other conditions which I haven't evaluated yet.
        Hide
        Philippe Lantin added a comment -

        We believe in some circumstances this issue may be avoided by increasing the thread stack size (-Xss). Thread stack size default depends on your OS and if you are using a 32-bit or 64-bit jvm. It may be different from jdk version to version as well. For example, Solaris 32-bit default is 512k, which linux is 128k.

        We'll report more findings as we get them, but if someone want to try this out and report, that would be great.

        Show
        Philippe Lantin added a comment - We believe in some circumstances this issue may be avoided by increasing the thread stack size (-Xss). Thread stack size default depends on your OS and if you are using a 32-bit or 64-bit jvm. It may be different from jdk version to version as well. For example, Solaris 32-bit default is 512k, which linux is 128k. We'll report more findings as we get them, but if someone want to try this out and report, that would be great.
        Hide
        Philippe Lantin added a comment -


        Confirmed: we have resolve our xerces StackOverflow issue using a 512k jvm thread stack size (-Xss512k).

        Show
        Philippe Lantin added a comment - Confirmed: we have resolve our xerces StackOverflow issue using a 512k jvm thread stack size (-Xss512k).
        Michael Glavassevich made changes -
        Priority Blocker [ 1 ]
        Assignee Xerces-J Developers Mailing List [ xerces-j-dev@xml.apache.org ]
        Michael Glavassevich made changes -
        Priority Blocker [ 1 ] Major [ 3 ]
        Hide
        Michael Glavassevich added a comment - - edited

        The number of recursive calls to matchString() is proportional to the length of the input. Increasing the stack size can be used to avoid the problem, but as Christiaan pointed out the string could be of any length so regardless of how large a stack you specify there will always be an input which is large enough to cause it to overflow.

        For folks watching this bug report who wonder why it's has been open for so long without a fix, it's primarily been because none of the current developers know this part of the code particularly well (a lot of it is older than Xerces itself and not well commented; 8+ years) and haven't had the time to learn the details well enough to do the re-design (recursion -> iteration). Likely it will be fixed one day, but probably only in the near future if someone from the community contributes a patch with a description of the changes and some unit tests to help verify it.

        Show
        Michael Glavassevich added a comment - - edited The number of recursive calls to matchString() is proportional to the length of the input. Increasing the stack size can be used to avoid the problem, but as Christiaan pointed out the string could be of any length so regardless of how large a stack you specify there will always be an input which is large enough to cause it to overflow. For folks watching this bug report who wonder why it's has been open for so long without a fix, it's primarily been because none of the current developers know this part of the code particularly well (a lot of it is older than Xerces itself and not well commented; 8+ years) and haven't had the time to learn the details well enough to do the re-design (recursion -> iteration). Likely it will be fixed one day, but probably only in the near future if someone from the community contributes a patch with a description of the changes and some unit tests to help verify it.
        Hide
        aaron pieper added a comment -

        I've worked around this issue by implementing a version of RegularExpression which is based on Sun's java.util.regex libraries. It is still incomplete, but allows us to validate XSDs without this problem occurring.

        Show
        aaron pieper added a comment - I've worked around this issue by implementing a version of RegularExpression which is based on Sun's java.util.regex libraries. It is still incomplete, but allows us to validate XSDs without this problem occurring.
        aaron pieper made changes -
        Attachment RegularExpression.java [ 12360203 ]
        Hide
        Michael Glavassevich added a comment -

        Sorry, java.util.regex isn't an option here. We cannot use it because it doesn't support the regular expression language defined by the XML Schema specification [1]. The one supported by java.util.regex is quite different. We also cannot use java.util.regex because it was introduced in Java 1.4. Xerces-J is still built with Java 1.3 (and should be able to run on Java 1.2).

        [1] http://www.w3.org/TR/xmlschema-2/#regexs

        Show
        Michael Glavassevich added a comment - Sorry, java.util.regex isn't an option here. We cannot use it because it doesn't support the regular expression language defined by the XML Schema specification [1] . The one supported by java.util.regex is quite different. We also cannot use java.util.regex because it was introduced in Java 1.4. Xerces-J is still built with Java 1.3 (and should be able to run on Java 1.2). [1] http://www.w3.org/TR/xmlschema-2/#regexs
        Hide
        aaron pieper added a comment -

        would it be a license violation to modify Sun's java.util.regex libraries as necessary and place the tweaked copy in xerces? (in a different package of course)

        Show
        aaron pieper added a comment - would it be a license violation to modify Sun's java.util.regex libraries as necessary and place the tweaked copy in xerces? (in a different package of course)
        Hide
        Michael Glavassevich added a comment -

        Setting aside that this would probably be far more work than just fixing the problem in the current implementation (and more risky; likely to introduce more bugs given the size of the change) we cannot do this either. Even with Sun's move to open source, we cannot bring their GPL'd code into Xerces (or any Apache project for that matter). See Apache's third-party licensing policy here: http://people.apache.org/~cliffs/3party.html for more info.

        Show
        Michael Glavassevich added a comment - Setting aside that this would probably be far more work than just fixing the problem in the current implementation (and more risky; likely to introduce more bugs given the size of the change) we cannot do this either. Even with Sun's move to open source, we cannot bring their GPL'd code into Xerces (or any Apache project for that matter). See Apache's third-party licensing policy here: http://people.apache.org/~cliffs/3party.html for more info.
        Hide
        aaron pieper added a comment -

        i understand now the multiple reasons why my solution is infeasible. thank you for taking the time to answer my questions.

        Show
        aaron pieper added a comment - i understand now the multiple reasons why my solution is infeasible. thank you for taking the time to answer my questions.
        Hide
        Nick Sydenham added a comment -

        I've altered the "public boolean matches(String target, int start, int end, Match match)" method so that it can handle some instances of large strings that have a pattern match applied to them. The change is not perfect and will likely not match groups of characters for instance that are separated by a comma or space. It will however work quite nicely where the pattern is checking that the string only contains valid characters.

        As a previous poster said the code is not well documented and would take serious study to refactor. To make this hack more palatable there could be a feature to turn it on or off (note that it only gets invoked for values over 200 characters already) and/or some separator searching (e.g. break on the nearest non-alphanumeric character). If this is desirable then if a core Xerces coder approves I'll attempt to add the feature(s).

        Show
        Nick Sydenham added a comment - I've altered the "public boolean matches(String target, int start, int end, Match match)" method so that it can handle some instances of large strings that have a pattern match applied to them. The change is not perfect and will likely not match groups of characters for instance that are separated by a comma or space. It will however work quite nicely where the pattern is checking that the string only contains valid characters. As a previous poster said the code is not well documented and would take serious study to refactor. To make this hack more palatable there could be a feature to turn it on or off (note that it only gets invoked for values over 200 characters already) and/or some separator searching (e.g. break on the nearest non-alphanumeric character). If this is desirable then if a core Xerces coder approves I'll attempt to add the feature(s).
        Nick Sydenham made changes -
        Attachment RegularExpression.java [ 12367155 ]
        Hide
        Eddie Graham added a comment -

        Can we have an update on when this bug will be fixed.

        Show
        Eddie Graham added a comment - Can we have an update on when this bug will be fixed.
        Hide
        Geoff Granum added a comment -

        I wrote this patch quite a while ago and then stalled out re-writing a test suite to work with 1.3 (from 1.5). I ran it against the xmlschema2006-11-06 package. It exposed one new bug which is an even more extreme edge case than this. The conversation is on the mailing list:

        http://mail-archives.apache.org/mod_mbox/xerces-j-dev/200707.mbox/%3COF91A8D40B.56D4BE1C-ON85257313.0011DDA4-85257313.00158EEC@ca.ibm.com%3E

        While I feel I tested this fairly well I would not be comfortable giving a go ahead for commit directly. For one, I just pulled it off the shelf and dusted it off after a few months of neglect. Also, there are some debug comments in the attached code, from me trying to work out how to fix the new OOM bug mentioned above.

        So, if someone wants to adopt a working, tested but not fully trusted* patch, this is for you.

        If I could swear that I would have time to jump on it again I would clean up those printlines before posting, but I'm and have been swamped. But rather than let it lie around...

        • I don't know the package very well, so am likely more paranoid than even the normal devs
        Show
        Geoff Granum added a comment - I wrote this patch quite a while ago and then stalled out re-writing a test suite to work with 1.3 (from 1.5). I ran it against the xmlschema2006-11-06 package. It exposed one new bug which is an even more extreme edge case than this. The conversation is on the mailing list: http://mail-archives.apache.org/mod_mbox/xerces-j-dev/200707.mbox/%3COF91A8D40B.56D4BE1C-ON85257313.0011DDA4-85257313.00158EEC@ca.ibm.com%3E While I feel I tested this fairly well I would not be comfortable giving a go ahead for commit directly. For one, I just pulled it off the shelf and dusted it off after a few months of neglect. Also, there are some debug comments in the attached code, from me trying to work out how to fix the new OOM bug mentioned above. So, if someone wants to adopt a working, tested but not fully trusted* patch, this is for you. If I could swear that I would have time to jump on it again I would clean up those printlines before posting, but I'm and have been swamped. But rather than let it lie around... I don't know the package very well, so am likely more paranoid than even the normal devs
        Geoff Granum made changes -
        Attachment RegularExpression.java [ 12376384 ]
        Hide
        Maycon Oliveira added a comment -

        Guys, any change about this issue? The things start to be more complicated. My country government created some schemas that when using JAXB or DOM or any other parser causes this error.

        I´m using jdk 1.6.14 and my OS is Windows XP. I just cannot change de expression because it is law....

        Any workaround?

        Show
        Maycon Oliveira added a comment - Guys, any change about this issue? The things start to be more complicated. My country government created some schemas that when using JAXB or DOM or any other parser causes this error. I´m using jdk 1.6.14 and my OS is Windows XP. I just cannot change de expression because it is law.... Any workaround?
        Hide
        Michael Glavassevich added a comment - - edited

        Workaround suggestions are already on this JIRA issue. See previous comments about increasing the stack size.

        Show
        Michael Glavassevich added a comment - - edited Workaround suggestions are already on this JIRA issue. See previous comments about increasing the stack size.
        Hide
        aaron pieper added a comment -

        The stack size workaround is not always feasible. I am currently trying to validate a document against a schema which includes a 4000 character string with a pattern restriction. I have increased the stack size considerably (-Xss100m) but still encounter a StackOverflowError. I cannot increase the stack size further without getting an OutOfMemoryError.

        I can still get by with one of the unofficial patches listed in this thread. But it would be nice to have some resolution on this show-stopping 7-year-old bug.

        Show
        aaron pieper added a comment - The stack size workaround is not always feasible. I am currently trying to validate a document against a schema which includes a 4000 character string with a pattern restriction. I have increased the stack size considerably (-Xss100m) but still encounter a StackOverflowError. I cannot increase the stack size further without getting an OutOfMemoryError. I can still get by with one of the unofficial patches listed in this thread. But it would be nice to have some resolution on this show-stopping 7-year-old bug.
        Michael Glavassevich made changes -
        Assignee Khaled Noaman [ knoaman@ca.ibm.com ]
        Hide
        Khaled Noaman added a comment -

        I have checked in a fix that solves the problem.

        Show
        Khaled Noaman added a comment - I have checked in a fix that solves the problem.
        Khaled Noaman made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 2.10.0 [ 12314400 ]
        Resolution Fixed [ 1 ]
        Mark Thomas made changes -
        Workflow jira [ 30644 ] Default workflow, editable Closed status [ 12575817 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12575817 ] jira [ 12598827 ]

          People

          • Assignee:
            Khaled Noaman
            Reporter:
            Mark Woon
          • Votes:
            14 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development