Commons Digester
  1. Commons Digester
  2. DIGESTER-48

[digester] Digester does not keep "root" variable in sync...

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: Nightly Builds
    • Fix Version/s: 1.7
    • Labels:
      None
    • Environment:

      Operating System: Windows XP
      Platform: PC

      Description

      The parse() method returns a previously calculated root variable even after a
      clear() operation.

        Activity

        Hide
        Henri Yandell added a comment -

        (Fixing JIRA import error)

        Show
        Henri Yandell added a comment - (Fixing JIRA import error)
        Hide
        Simon Kitching added a comment -

        Re James Carman's proposal:

        I've added a resetRoot() method to the Digester class which makes it possible
        to reset this variable before parsing a second document using the same Digester
        instance. This does not mean that using a Digester instance in this way is
        supported, and this is noted in the javadoc.

        Re Eric Lucas' proposal:

        As there has been no followup to this since 17 June I think it reasonable to
        consider this dead.

        Show
        Simon Kitching added a comment - Re James Carman's proposal: I've added a resetRoot() method to the Digester class which makes it possible to reset this variable before parsing a second document using the same Digester instance. This does not mean that using a Digester instance in this way is supported, and this is noted in the javadoc. Re Eric Lucas' proposal: As there has been no followup to this since 17 June I think it reasonable to consider this dead.
        Hide
        Simon Kitching added a comment -

        Hi Eric,

        I'm definitely interested in viewing your patches to make digester
        "thread-safe". If you could get permission to attach these to this bug entry, it
        would be appreciated.

        NB: Please ensure you have appropriate permission from your employer before
        posting any code. If commons dev does like the changes, we may need to get some
        official release before incorporating the code; I guess we can deal with that if
        there is interest in merging your changes.

        Show
        Simon Kitching added a comment - Hi Eric, I'm definitely interested in viewing your patches to make digester "thread-safe". If you could get permission to attach these to this bug entry, it would be appreciated. NB: Please ensure you have appropriate permission from your employer before posting any code. If commons dev does like the changes, we may need to get some official release before incorporating the code; I guess we can deal with that if there is interest in merging your changes.
        Hide
        Eric Lucas added a comment -

        Hello,

        I've recently made some changes to digester to make it thread safe for a
        project here at ebay. We use digester for a long running batch like process
        where pooling wasn't really an optimal solution for us as the parse times are
        quite long (of the number of rules we have - hence filling up the pool when
        there is a large spike might be problematic) and the volume is uneven, peaking
        fairly high with a restrictive SLA for the system.

        I've written some unit tests (still need to write more) for the changes and am
        currently looking for optimization and clean up opportunities. Would a patch
        be of any interest to anyone?

        I will also send this message in an email to the commons-dev mail list.

        Thanks.

        -Eric Lucas

        Show
        Eric Lucas added a comment - Hello, I've recently made some changes to digester to make it thread safe for a project here at ebay. We use digester for a long running batch like process where pooling wasn't really an optimal solution for us as the parse times are quite long (of the number of rules we have - hence filling up the pool when there is a large spike might be problematic) and the volume is uneven, peaking fairly high with a restrictive SLA for the system. I've written some unit tests (still need to write more) for the changes and am currently looking for optimization and clean up opportunities. Would a patch be of any interest to anyone? I will also send this message in an email to the commons-dev mail list. Thanks. -Eric Lucas
        Hide
        rdonkin@apache.org added a comment -

        On the use of clear()...

        I'm not sure that clear() is very useful but it's there in the API and i see no reason to deprecate it or to
        patch it since is works pretty much as advertised. I'd much prefer a new reset() method which could do
        other things (such as nulling the root)

        On pooling Digester instances...

        I'm not against pooling Digester instances but I am a little sceptical about the performance gains in
        most common use cases. It isn't an itch of mine. So, i'm unlikely to actively work towards being able to
        safely pool Digester though i wouldn't object to backwards compatible changes to assist pooling.

        On reusing Rule instances...

        In general, people can create Rule implementations that cannot be used safely more than once. But I
        agree with Simon's analysis that most of the common Rule implementation can be reused safely.

        On compilation...

        This would be very cool but a lot of work. Not an itch for me.

        Show
        rdonkin@apache.org added a comment - On the use of clear()... I'm not sure that clear() is very useful but it's there in the API and i see no reason to deprecate it or to patch it since is works pretty much as advertised. I'd much prefer a new reset() method which could do other things (such as nulling the root) On pooling Digester instances... I'm not against pooling Digester instances but I am a little sceptical about the performance gains in most common use cases. It isn't an itch of mine. So, i'm unlikely to actively work towards being able to safely pool Digester though i wouldn't object to backwards compatible changes to assist pooling. On reusing Rule instances... In general, people can create Rule implementations that cannot be used safely more than once. But I agree with Simon's analysis that most of the common Rule implementation can be reused safely. On compilation... This would be very cool but a lot of work. Not an itch for me.
        Hide
        Simon Kitching added a comment -

        None of the Digester classes are "thread-safe" in the sense of being able to be
        used concurrently by multiple threads. This is already documented.

        But I expect that almost all Rule classes are "poolable", in the sense that they
        can be reused without their behaviour being affected by what happened the last
        time they were used. Because Rule objects are permitted (and expected) to be
        invoked recursively, they normally cannot keep any kind of context information
        using their own member variables; those that do are almost certainly broken in
        other ways.

        I had a quick peek at the standard Rule classes:

        • FactoryCreateRule may have problems when the ignoreCreateRule option is used
          [but that's pretty rare]. This could be fixed by modifying it to use the new
          "named stacks" feature of the Digester - and should be; it probably fails when
          invoked recursively.
        • NodeCreateRule may have problems if the previous parse terminated with an
          error because it doesn't reset the depth member on begin().

        These are pretty unusual "corner cases". I couldn't see any other issues with
        reusing instances of the standard Rule classes [NB: my opinion only].

        And I expect these problem cases will occur only when a parse fails, so simply
        ensuring that on parse failure the Rules object is removed from the pool and
        replaced by a new one should resolve these.

        Show
        Simon Kitching added a comment - None of the Digester classes are "thread-safe" in the sense of being able to be used concurrently by multiple threads. This is already documented. But I expect that almost all Rule classes are "poolable", in the sense that they can be reused without their behaviour being affected by what happened the last time they were used. Because Rule objects are permitted (and expected) to be invoked recursively, they normally cannot keep any kind of context information using their own member variables; those that do are almost certainly broken in other ways. I had a quick peek at the standard Rule classes: FactoryCreateRule may have problems when the ignoreCreateRule option is used [but that's pretty rare] . This could be fixed by modifying it to use the new "named stacks" feature of the Digester - and should be; it probably fails when invoked recursively. NodeCreateRule may have problems if the previous parse terminated with an error because it doesn't reset the depth member on begin(). These are pretty unusual "corner cases". I couldn't see any other issues with reusing instances of the standard Rule classes [NB: my opinion only] . And I expect these problem cases will occur only when a parse fails, so simply ensuring that on parse failure the Rules object is removed from the pool and replaced by a new one should resolve these.
        Hide
        James Carman added a comment -

        I'm not so worried about the Rules implementations themselves having any
        state. I'm more concerned with the individual Rule objects INSIDE the Rules
        object. I would imagine that a lot of Rule implementations are NOT thread-
        safe, so we would not be able to use a single Rules instance containing
        multiple Rule instances across threads. Correct?

        Show
        James Carman added a comment - I'm not so worried about the Rules implementations themselves having any state. I'm more concerned with the individual Rule objects INSIDE the Rules object. I would imagine that a lot of Rule implementations are NOT thread- safe, so we would not be able to use a single Rules instance containing multiple Rule instances across threads. Correct?
        Hide
        Simon Kitching added a comment -

        Hmm..RuleSetBase is abstract. I'm surprised that there isn't a concrete
        implementation of it, except in the xmlrules package.

        But it should also be safe to create a Rules object, and add the rules to it.
        You should be able to then reuse this Rules object in multiple different
        Digester objects.

        eg
        Rules myRules = new RulesBase();
        myRules.addRule("foo/bar", new ObjectCreateRule(String.class));

        Then
        digester = new Digester();
        digester.setRules(myRules);

        The RulesBase class doesn't have any state that changes during a parse as far as
        I can see (unlike the Digester class). So there's no need to "reset" it between
        parses. And this is likely to hold true for all the Rules implementations, I think.

        I suspect that creating the xml parser object will also be moderately costly, so
        suggest that is created once and passed to the Digester (or even more cleanly
        have the Digester instance set as the parser's content handler).

        Show
        Simon Kitching added a comment - Hmm..RuleSetBase is abstract. I'm surprised that there isn't a concrete implementation of it, except in the xmlrules package. But it should also be safe to create a Rules object, and add the rules to it. You should be able to then reuse this Rules object in multiple different Digester objects. eg Rules myRules = new RulesBase(); myRules.addRule("foo/bar", new ObjectCreateRule(String.class)); Then digester = new Digester(); digester.setRules(myRules); The RulesBase class doesn't have any state that changes during a parse as far as I can see (unlike the Digester class). So there's no need to "reset" it between parses. And this is likely to hold true for all the Rules implementations, I think. I suspect that creating the xml parser object will also be moderately costly, so suggest that is created once and passed to the Digester (or even more cleanly have the Digester instance set as the parser's content handler).
        Hide
        James Carman added a comment -

        Maybe what is needed is a way to "compile" or preprocess a set of rules. I
        don't think the curren RuleSet provides that type of functionality, but that's
        really what takes so long is the rule construction for a digester. If the
        rules could be "plugged in" to a Digester and then have it parse something
        that would eliminate the need for the Digester to be pooled.

        Show
        James Carman added a comment - Maybe what is needed is a way to "compile" or preprocess a set of rules. I don't think the curren RuleSet provides that type of functionality, but that's really what takes so long is the rule construction for a digester. If the rules could be "plugged in" to a Digester and then have it parse something that would eliminate the need for the Digester to be pooled.
        Hide
        Simon Kitching added a comment -

        Re the purpose of the "getRoot" method: Stacks don't have roots. Trees have
        roots. Digester's main purpose is to parse xml and generate a corresponding tree
        of java objects. And the getRoot method returns the object which is the root of
        the generated tree. Now that object might have been pushed onto the stack
        before parsing began (in which case it will still be on the stack when parsing
        finishes), or it may have been created by the first ObjectCreateRule to fire (in
        which case it will not be on the stack when parsing finishes).

        I guess this could be better explained; I will think about adding some javadoc
        to the getRoot method.

        Robert: I still am not sure that a user of Digester would ever have a reason to
        call clear, nor that there is a reason a Rule class would ever want to. The only
        reason I can see for this method to have ever existed is to provide "reset-like"
        behaviour. But it doesn't, and that behaviour is rather nasty to provide
        correctly [as I found out when I tried]. Can you suggest any situation in which
        calling clear would have a useful effect other than as preparation for calling
        parse a second time?

        I think that "pooling" digester instances is a really bad idea. And not
        necessary; if a "poolable" RuleSet is created to hold a reusable set of rules,
        and a "poolable" parser object is created, then the Digester object itself is
        pretty light-weight to create and configure for each document parsed.

        Comments?

        Show
        Simon Kitching added a comment - Re the purpose of the "getRoot" method: Stacks don't have roots. Trees have roots. Digester's main purpose is to parse xml and generate a corresponding tree of java objects. And the getRoot method returns the object which is the root of the generated tree. Now that object might have been pushed onto the stack before parsing began (in which case it will still be on the stack when parsing finishes), or it may have been created by the first ObjectCreateRule to fire (in which case it will not be on the stack when parsing finishes). I guess this could be better explained; I will think about adding some javadoc to the getRoot method. Robert: I still am not sure that a user of Digester would ever have a reason to call clear, nor that there is a reason a Rule class would ever want to. The only reason I can see for this method to have ever existed is to provide "reset-like" behaviour. But it doesn't, and that behaviour is rather nasty to provide correctly [as I found out when I tried] . Can you suggest any situation in which calling clear would have a useful effect other than as preparation for calling parse a second time? I think that "pooling" digester instances is a really bad idea. And not necessary; if a "poolable" RuleSet is created to hold a reusable set of rules, and a "poolable" parser object is created, then the Digester object itself is pretty light-weight to create and configure for each document parsed. Comments?
        Hide
        rdonkin@apache.org added a comment -

        i'm happy that clear does what it should do: it cleans the stacks. i'd probably favour a separate reset()
        method rather than breaking the contact for clear().

        it is possible (in some cases) for a single digester instance to parse multiple input documents in
        sequence. it's just that the typical use case is to create a new digester instance in each case and so
        multiple sequential usage has never really been supported.

        in terms of http://www.mail-archive.com/commons-user@jakarta.apache.org/msg06390.html, i'm a
        little sceptical about the performance of pooled digester instances verses creating new instances each
        time (at least in modern JVMs). depending on the level of concurrency and the number of digester rules,
        the cost of the synchronization required by the pool may outweigh the construction cost. i would be
        interested to see some actual timings.

        anyway, i suspect that it'd be possible to add a reset method that would work for most rulesets (those
        that clean up after themselves in finish). i'd prefer for any implementation to wait until after the 1.6.0
        release branch is taken.

        Robert

        Show
        rdonkin@apache.org added a comment - i'm happy that clear does what it should do: it cleans the stacks. i'd probably favour a separate reset() method rather than breaking the contact for clear(). it is possible (in some cases) for a single digester instance to parse multiple input documents in sequence. it's just that the typical use case is to create a new digester instance in each case and so multiple sequential usage has never really been supported. in terms of http://www.mail-archive.com/commons-user@jakarta.apache.org/msg06390.html , i'm a little sceptical about the performance of pooled digester instances verses creating new instances each time (at least in modern JVMs). depending on the level of concurrency and the number of digester rules, the cost of the synchronization required by the pool may outweigh the construction cost. i would be interested to see some actual timings. anyway, i suspect that it'd be possible to add a reset method that would work for most rulesets (those that clean up after themselves in finish). i'd prefer for any implementation to wait until after the 1.6.0 release branch is taken. Robert
        Hide
        James Carman added a comment -

        Ahhhh. I see what you mean about the "first object created" stuff. The
        Catalog example would fail. However, the documentation isn't really clear on
        how the stack is supposed to work. The idea that the first thing created by
        the rules will be returned isn't exactly obvious. In my mind, I don't imagine
        anything to be on the stack at the end of parsing unless I push something on
        there to begin with, using most of the default rules. So, when I use
        Digester, I usually push my "root" object onto the stack prior to parsing. I
        didn't realize that the first object pushed onto the stack is what's supposed
        to be returned from parse(). That's not clear from the documentation. Or,
        maybe that was just my misunderstanding. I don't really like that little
        twist, though. It doesn't seem as clean, if you ask me. But, there's a LOT
        of code out there using it the way it is now, so we'd better not break it.
        And, you're right, my implementation WILL break a LOT of stuff. Oops!

        Show
        James Carman added a comment - Ahhhh. I see what you mean about the "first object created" stuff. The Catalog example would fail. However, the documentation isn't really clear on how the stack is supposed to work. The idea that the first thing created by the rules will be returned isn't exactly obvious. In my mind, I don't imagine anything to be on the stack at the end of parsing unless I push something on there to begin with, using most of the default rules. So, when I use Digester, I usually push my "root" object onto the stack prior to parsing. I didn't realize that the first object pushed onto the stack is what's supposed to be returned from parse(). That's not clear from the documentation. Or, maybe that was just my misunderstanding. I don't really like that little twist, though. It doesn't seem as clean, if you ask me. But, there's a LOT of code out there using it the way it is now, so we'd better not break it. And, you're right, my implementation WILL break a LOT of stuff. Oops!
        Hide
        James Carman added a comment -

        My implementation of the getRoot() method takes the empty stack into account.
        Here's the method...

        public Object getRoot()

        { return stack.isEmpty() ? null : stack.peek( stack.size() - 1 ); }

        Why would we deprecate the clear() method rather than make it work properly if
        we can? The intention of the getRoot() method is to return the first object
        on the stack or null if the stack is empty, right? That's what it does, now.
        I don't see the harm in removing the root variable and replacing the
        references with a call to getRoot() instead (which is what I did). I would
        strongly suggest you add this patch regardless of what you do to the clear()
        method. This implementation is far less prone for errors.

        Re-using a Digester is desirable because it can take quite some time to set up
        a Digester object for parsing. The desire to re-use a Digester object came
        from a post on the user list (http://www.mail-archive.com/commons-
        user@jakarta.apache.org/msg06397.html).

        BTW, your implementation of testGetRoot() causes an error because the
        ObjectCreateRule automatically calls pop() during the end() method, so there
        will be nothing on the stack.

        Show
        James Carman added a comment - My implementation of the getRoot() method takes the empty stack into account. Here's the method... public Object getRoot() { return stack.isEmpty() ? null : stack.peek( stack.size() - 1 ); } Why would we deprecate the clear() method rather than make it work properly if we can? The intention of the getRoot() method is to return the first object on the stack or null if the stack is empty, right? That's what it does, now. I don't see the harm in removing the root variable and replacing the references with a call to getRoot() instead (which is what I did). I would strongly suggest you add this patch regardless of what you do to the clear() method. This implementation is far less prone for errors. Re-using a Digester is desirable because it can take quite some time to set up a Digester object for parsing. The desire to re-use a Digester object came from a post on the user list ( http://www.mail-archive.com/commons- user@jakarta.apache.org/msg06397.html). BTW, your implementation of testGetRoot() causes an error because the ObjectCreateRule automatically calls pop() during the end() method, so there will be nothing on the stack.
        Hide
        Simon Kitching added a comment -

        Hi James,

        After looking at the clear method, I'm not sure what the original intention of
        the method was. It may be that it was never intended to be a public method at
        all. I will ask on the email lists and see if anyone knows; digester arrived in
        commons from the Tomcat project with this method already present and declared
        public.

        It may be possible to perform repeated parses with the same Digester object
        under some circumstances, but I wouldn't bet on it. If this is what you are
        trying to achieve, I suggest you change your code to create a new Digester
        object for each input document instead.

        The root variable can't just be reset to null from within the clear method,
        because clear() is called automatically at the end of the parse() method. This
        leads me to think that maybe it is just intended to free up any memory and is
        not meant to be used directly [ie shouldn't be public]. And while it could be
        reset from the start of the parse method, there are heaps of other variables
        that would also need to be reset in order to safely perform a second parse with
        the same Digester object, eg namespaces and matches.

        I have therefore added a javadoc comment to the clear method saying it shouldn't
        be used. I think it might even be a good idea to deprecate it so it can be
        removed in the release-after-next.

        I don't believe your proposed patch for removing the root member entirely will
        work correctly, unfortunately. The digester's parse method can be called without
        any object present on the stack at start. In this case, the digester stack is
        empty when the parse method returns. Your new getRoot method will therefore fail
        to return the "first created object". See the "Catalog" example in CVS for an
        app that would fail with this patch.

        I'd like to leave this bug open as a reminder to think about the clear method a
        bit more before the next release.

        James, please add any comments you have to this bug entry.

        And by the way, the attached patch file is all screwed up, due to windows/unix
        line feed differences....

        Show
        Simon Kitching added a comment - Hi James, After looking at the clear method, I'm not sure what the original intention of the method was. It may be that it was never intended to be a public method at all. I will ask on the email lists and see if anyone knows; digester arrived in commons from the Tomcat project with this method already present and declared public. It may be possible to perform repeated parses with the same Digester object under some circumstances, but I wouldn't bet on it. If this is what you are trying to achieve, I suggest you change your code to create a new Digester object for each input document instead. The root variable can't just be reset to null from within the clear method, because clear() is called automatically at the end of the parse() method. This leads me to think that maybe it is just intended to free up any memory and is not meant to be used directly [ie shouldn't be public] . And while it could be reset from the start of the parse method, there are heaps of other variables that would also need to be reset in order to safely perform a second parse with the same Digester object, eg namespaces and matches. I have therefore added a javadoc comment to the clear method saying it shouldn't be used. I think it might even be a good idea to deprecate it so it can be removed in the release-after-next. I don't believe your proposed patch for removing the root member entirely will work correctly, unfortunately. The digester's parse method can be called without any object present on the stack at start. In this case, the digester stack is empty when the parse method returns. Your new getRoot method will therefore fail to return the "first created object". See the "Catalog" example in CVS for an app that would fail with this patch. I'd like to leave this bug open as a reminder to think about the clear method a bit more before the next release. James, please add any comments you have to this bug entry. And by the way, the attached patch file is all screwed up, due to windows/unix line feed differences....
        Hide
        James Carman added a comment -

        The Digester class currently maintains a reference to the root object on the
        stack via a member variable named "root." However, it doesn't properly keep
        this in sync upon a clear() operation (see attached, simplified test case).
        IMHO, the variable should be removed completely. You're (or I guess we're in
        the open source world) just asking for problems when trying to keep these
        values in sync. The patch (also attached) completely removes the root
        variable and re-implements the getRoot() method to use the stack API to
        reference the root element (peek using size - 1) instead. In order to remove
        the variable, I had to modify the SetRootRule class also as it directly
        referenced the root variable (bad idea). I changed it to use the getRoot()
        method instead. I also added some checks in the test case
        DigesterTestCase.testStackMethods() to exercise the code that I modified and
        to illustrate the bug that was there.

        Show
        James Carman added a comment - The Digester class currently maintains a reference to the root object on the stack via a member variable named "root." However, it doesn't properly keep this in sync upon a clear() operation (see attached, simplified test case). IMHO, the variable should be removed completely. You're (or I guess we're in the open source world) just asking for problems when trying to keep these values in sync. The patch (also attached) completely removes the root variable and re-implements the getRoot() method to use the stack API to reference the root element (peek using size - 1) instead. In order to remove the variable, I had to modify the SetRootRule class also as it directly referenced the root variable (bad idea). I changed it to use the getRoot() method instead. I also added some checks in the test case DigesterTestCase.testStackMethods() to exercise the code that I modified and to illustrate the bug that was there.
        Hide
        James Carman added a comment -

        Created an attachment (id=11780)
        A patch file that fixes the bug.

        Show
        James Carman added a comment - Created an attachment (id=11780) A patch file that fixes the bug.
        Hide
        James Carman added a comment -

        Created an attachment (id=11779)
        A test case that illustrates the bug.

        Show
        James Carman added a comment - Created an attachment (id=11779) A test case that illustrates the bug.

          People

          • Assignee:
            Unassigned
            Reporter:
            James Carman
          • Votes:
            3 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development