Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.2.7
    • Fix Version/s: 3.3.1
    • Component/s: core
    • Labels:
      None

      Description

      Form fields are saved as a list in a FieldStorage class instance. The class implements a _getitem_ method to provide dict-like behaviour. This method iterates over the complete list for every call to _getitem_. Applications that need to access all the fields when processing the form will show O(n^2) behaviour where n == the number of form fields. This overhead could be avoided by creating a dict (to use as an index) when the FieldStorage instance is created.

      Mike Looijmans has been investigating StringField and Field as well. It is probably reasonable to include information on his work in this issue as well, so that we can consider all of these efficiency issues in toto.

        Activity

        Hide
        jgallacher Jim Gallacher added a comment -

        The FieldStorage methods get, getitems, has_key, _len_, getfirst and getlist all iterate over the complete list of fields each time they are called and so would all benefit from some kind of indexing scheme. Here is a bit of code that will give you some idea of what I have in mind.

        class FieldStorage:

        def _init_():
        self.index = {} # holds references to the Field objects and can be used as an index.
        self.list = []

        ... blah blah blah ...
        ... create the field and get it's name ...

        self.list.append(field)
        if name in self.index:
        self.index[name].append(field)
        else:
        self.index[name] = [field,]

        def _getitem_(self, key):
        if key not in self.index:
        raise KeyError, key
        found = self.index[key]
        if len(found) == 1:
        return found[0]
        else:
        return found

        The other FieldStorage methods would need to be refactored to take advantage of the index.

        Show
        jgallacher Jim Gallacher added a comment - The FieldStorage methods get, getitems, has_key, _ len _, getfirst and getlist all iterate over the complete list of fields each time they are called and so would all benefit from some kind of indexing scheme. Here is a bit of code that will give you some idea of what I have in mind. class FieldStorage: def _ init _(): self.index = {} # holds references to the Field objects and can be used as an index. self.list = [] ... blah blah blah ... ... create the field and get it's name ... self.list.append(field) if name in self.index: self.index [name] .append(field) else: self.index [name] = [field,] def _ getitem _(self, key): if key not in self.index: raise KeyError, key found = self.index [key] if len(found) == 1: return found [0] else: return found The other FieldStorage methods would need to be refactored to take advantage of the index.
        Hide
        milosoftware Mike Looijmans added a comment -

        What it does:

        • Simplifies the creation of StringField objects. This was already
          marked as a TODO in the 3.2.5b code.
        • Does not create a file object (cStringIO) for each and every field.
        • FieldStorage.get() will always return the same instance(s) for any
          given name.
        • FieldStorage.get() is very cheap now (does not create new objects,
          which is expensive in Python)
        • use a dictionary-on-demand.
        • Lots of code removed for this one, and a few lines added.
        • items() function, which, contrary to common belief, preserves argument ordering.
        Show
        milosoftware Mike Looijmans added a comment - What it does: Simplifies the creation of StringField objects. This was already marked as a TODO in the 3.2.5b code. Does not create a file object (cStringIO) for each and every field. FieldStorage.get() will always return the same instance(s) for any given name. FieldStorage.get() is very cheap now (does not create new objects, which is expensive in Python) use a dictionary-on-demand. Lots of code removed for this one, and a few lines added. items() function, which, contrary to common belief, preserves argument ordering.
        Hide
        jgallacher Jim Gallacher added a comment -

        I was checking the documentation with respect to this patch and I noticed a couple of things.

        The FieldStorage constructor has 2 undocumented parameters, file_callback=None, field_callback=None. These were added in r160259 as part of Barry Pearce's fix for MODPYTHON-40. In the same patch, the make_file and make_field methods were added. These allow FieldStorage to be subclassed and provide an alternative to using the callbacks.

        In Mike's modpython325_util_py_dict.patch, the make_field and make_file methods have be removed, with the reasoning that the callbacks should be sufficient for controlling the upload.

        We never had a discussion of the merits of the 2 approaches - subclassing vs passing the callbacks in the FieldStorage constructor. Should we perhaps allow both, and reinstate make_field and make_file, or are the callbacks sufficient?

        Further infomation on the callbacks, including some nice examples, can be found in the python-dev archives. See http://article.gmane.org/gmane.comp.apache.mod-python.devel/760

        I'll update the docs as part of the fix for this issue.

        Show
        jgallacher Jim Gallacher added a comment - I was checking the documentation with respect to this patch and I noticed a couple of things. The FieldStorage constructor has 2 undocumented parameters, file_callback=None, field_callback=None. These were added in r160259 as part of Barry Pearce's fix for MODPYTHON-40 . In the same patch, the make_file and make_field methods were added. These allow FieldStorage to be subclassed and provide an alternative to using the callbacks. In Mike's modpython325_util_py_dict.patch, the make_field and make_file methods have be removed, with the reasoning that the callbacks should be sufficient for controlling the upload. We never had a discussion of the merits of the 2 approaches - subclassing vs passing the callbacks in the FieldStorage constructor. Should we perhaps allow both, and reinstate make_field and make_file, or are the callbacks sufficient? Further infomation on the callbacks, including some nice examples, can be found in the python-dev archives. See http://article.gmane.org/gmane.comp.apache.mod-python.devel/760 I'll update the docs as part of the fix for this issue.
        Hide
        jgallacher Jim Gallacher added a comment -

        As part of the improvments to FieldStorage, the Field class _init_ method was changed in trunk (3.3-dev) and backported to branches/3.2.x. Although the documentation states that the Field class is for internal use by
        FieldStorage, some software applications create Field instances directly. This change in the _init_ signature causes those applications to fail.

        Breaking compatibility may be acceptable for a minor version increment (3.2 > 3.2) but is undesirable for a patch release (3.2.8 > 3.2.9). Changing the Field class to be both forward and backward compatible wil not be difficult. As such I'll commit a fix to branches/3.2.x in time for a 3.2.9 release.

        Show
        jgallacher Jim Gallacher added a comment - As part of the improvments to FieldStorage, the Field class _ init _ method was changed in trunk (3.3-dev) and backported to branches/3.2.x. Although the documentation states that the Field class is for internal use by FieldStorage, some software applications create Field instances directly. This change in the _ init _ signature causes those applications to fail. Breaking compatibility may be acceptable for a minor version increment (3.2 > 3.2) but is undesirable for a patch release (3.2.8 > 3.2.9). Changing the Field class to be both forward and backward compatible wil not be difficult. As such I'll commit a fix to branches/3.2.x in time for a 3.2.9 release.
        Hide
        jgallacher Jim Gallacher added a comment -

        Reopened issue due to problems experienced by some applications such as Trac.

        We'll need to re-examine the code committed to make sure that FieldStorage behaves properly as a dictionary-like object.

        Show
        jgallacher Jim Gallacher added a comment - Reopened issue due to problems experienced by some applications such as Trac. We'll need to re-examine the code committed to make sure that FieldStorage behaves properly as a dictionary-like object.
        Hide
        grahamd Graham Dumpleton added a comment -

        Have we decided that we will not try and be compatible with Trac and expect people using mod_python 3.3 to use the next unreleased version of Trac? This next version of Trac works via a WSGI gateway and thus doesn't use FieldStorage and thus doesn't have the problems seen here. Can we mark the original issue as resolved for 3.3?

        Show
        grahamd Graham Dumpleton added a comment - Have we decided that we will not try and be compatible with Trac and expect people using mod_python 3.3 to use the next unreleased version of Trac? This next version of Trac works via a WSGI gateway and thus doesn't use FieldStorage and thus doesn't have the problems seen here. Can we mark the original issue as resolved for 3.3?
        Hide
        grahamd Graham Dumpleton added a comment -

        I have looked at the code for this again and to somehow accommodate older versions of Trac is only going to introduce horrible little hacks in mod_python that we don't need, so it just isn't worth the trouble to try and support the older versions. People will simply need to upgrade to at least Trac 0.10.0 if they want to use mod_python 3.3.

        If the hacks were done, there would be two parts to it. The first is that Trac derives from util.FieldStorage and adds a _setitem_() method. This would need to be overridden from the base class by replacing it with add_field().

        def _init_(self, ...):
        ...
        self._setitem_ = self.add_field

        It has to be done in the constructor and not just as a method, as doing it as a method means derived class overrides it. At that point in the constructor, derived class one is in place and so we effectively replace it.

        The second part would be more difficult and why it isn't worth the trouble. The problem here is that although the derived class adds _setitem(), it doesn't itself use it in the derived class constructor. Instead, it does the nasty thing of accessing the base class 'list' attribute directly, thereby duplicating the same code as in __setitem_() that we need to replace. The only way around this nastiness would be for the base class to override the 'append()' method of the 'list' attribute itself and do extra special magic to check for arguments of the old style and change things on the fly, updating the associated index attribute as well. This is just all too messy and magic and would surely come back and bite us.

        In summary, not worth the trouble. Trac people will just have to upgrade.

        Show
        grahamd Graham Dumpleton added a comment - I have looked at the code for this again and to somehow accommodate older versions of Trac is only going to introduce horrible little hacks in mod_python that we don't need, so it just isn't worth the trouble to try and support the older versions. People will simply need to upgrade to at least Trac 0.10.0 if they want to use mod_python 3.3. If the hacks were done, there would be two parts to it. The first is that Trac derives from util.FieldStorage and adds a _ setitem _() method. This would need to be overridden from the base class by replacing it with add_field(). def _ init _(self, ...): ... self._ setitem _ = self.add_field It has to be done in the constructor and not just as a method, as doing it as a method means derived class overrides it. At that point in the constructor, derived class one is in place and so we effectively replace it. The second part would be more difficult and why it isn't worth the trouble. The problem here is that although the derived class adds _ setitem (), it doesn't itself use it in the derived class constructor. Instead, it does the nasty thing of accessing the base class 'list' attribute directly, thereby duplicating the same code as in __setitem _() that we need to replace. The only way around this nastiness would be for the base class to override the 'append()' method of the 'list' attribute itself and do extra special magic to check for arguments of the old style and change things on the fly, updating the associated index attribute as well. This is just all too messy and magic and would surely come back and bite us. In summary, not worth the trouble. Trac people will just have to upgrade.
        Hide
        grahamd Graham Dumpleton added a comment -

        In response to question about dictionary like abilities on mailing list, I actually had my first good look at what the performance optimisation changes did. Previously I only looked at just the bit I thought mattered to the Trac issue. Having done this, I notice there are actually some problems in the code. Specifically, if add_field() is used after the first time the 'dictionary' attribute is accessed, then the new field doesn't actually get put in the dictionary and therefore isn't visible to _getitem_(), has_key() etc. In other words, the 'dictionary' gets set to a snapshot of 'list' and isn't updated there after. I also note that changes which introduced the method add_field() aren't reflected in documentation either.

        I'll spend some time reviewing the code and fixing things up. I might also make the 'dictionary' attribute a private attribute and perhaps look at direct changes to 'list' being automatically reflected in any index somehow. In respect of 'dictionary' I want to hide it so we avoid problems in the future like in Trac where they effectively delved into the implementation, making it hard to make changes.

        Show
        grahamd Graham Dumpleton added a comment - In response to question about dictionary like abilities on mailing list, I actually had my first good look at what the performance optimisation changes did. Previously I only looked at just the bit I thought mattered to the Trac issue. Having done this, I notice there are actually some problems in the code. Specifically, if add_field() is used after the first time the 'dictionary' attribute is accessed, then the new field doesn't actually get put in the dictionary and therefore isn't visible to _ getitem _(), has_key() etc. In other words, the 'dictionary' gets set to a snapshot of 'list' and isn't updated there after. I also note that changes which introduced the method add_field() aren't reflected in documentation either. I'll spend some time reviewing the code and fixing things up. I might also make the 'dictionary' attribute a private attribute and perhaps look at direct changes to 'list' being automatically reflected in any index somehow. In respect of 'dictionary' I want to hide it so we avoid problems in the future like in Trac where they effectively delved into the implementation, making it hard to make changes.
        Hide
        milosoftware Mike Looijmans added a comment -

        A very simple approach would be to delete the dictionary attribute when
        any change is made.

        The add_field method should be made private, and one could add a
        _setitem_ method like this:

        def _setitem_(self, key, value):
        if self._dict_.has_key('dictionary'):
        del self.dictionary
        self._add_field(key, value)

        A similar approach can be taken to add the _delitem_ functionality.

        Mike Looijmans
        Philips Natlab / Topic Automation

        Show
        milosoftware Mike Looijmans added a comment - A very simple approach would be to delete the dictionary attribute when any change is made. The add_field method should be made private, and one could add a _ setitem _ method like this: def _ setitem _(self, key, value): if self._ dict _.has_key('dictionary'): del self.dictionary self._add_field(key, value) A similar approach can be taken to add the _ delitem _ functionality. Mike Looijmans Philips Natlab / Topic Automation
        Hide
        grahamd Graham Dumpleton added a comment -

        I have committed some changes into repository now which
        fixes the issue with add_field() but also does things in such a way that
        code such as Trac which creates Field instances direct and adds them
        to the the list of form fields will still work. It does this by using a
        derived list class instance which invalidates the key lookup table on
        changes to the list so that the lookup table is rebuilt the next time
        it is required. Doing this wasn't as bad as I first thought it would be
        as hadn't gone to the extent of properly understanding the code before.
        Also include a semi hack in Field class constructor to detect old style
        use where code expects more than just the name of the field as
        argument and do appropriate initialisation as appropriate.

        In all, what the changes allow is for the following to work:

        from mod_python import util
        from StringIO import StringIO

        def index(req):
        req.content_type = 'text/plain'

        req.write('%s\n\n' % req.form.keys())

        for key in req.form.keys():
        req.write('%s %s\n' % (key, req.form.getlist(key)))

        req.write('\n')

        req.form.add_field('e', 'f')

        req.form.list.append(util.Field('g', StringIO('h'),
        "text/plain", {}, None, {}))
        req.form.list.extend([util.Field('i', StringIO('j'),
        "text/plain", {}, None, {})])
        req.form.list += [util.Field('k', StringIO('l'),
        "text/plain", {}, None, {})]

        req.write('%s\n\n' % req.form.keys())

        for key in req.form.keys():
        req.write('%s %s\n' % (key, req.form.getlist(key)))

        req.write('\n')

        return None

        If this is accessed with:

        http://localhost:8002/~grahamd/form/?a=b&c=d&e=f

        I get:

        ['a', 'c', 'e']

        a [Field('a', 'b')]
        c [Field('c', 'd')]
        e [Field('e', 'f')]

        ['a', 'c', 'e', 'g', 'i', 'k']

        a [Field('a', 'b')]
        c [Field('c', 'd')]
        e [Field('e', 'f'), Field('e', 'f')]
        g [Field('g', 'h')]
        i [Field('i', 'j')]
        k [Field('k', 'l')]

        As the req.form.keys() method works off the lookup table, you can see
        how updates direct to list work okay, meaning stuff like Trac should
        work.

        You will also note that the 'dictionary' attribute has gone and is hidden
        elsewhere. Hopefully people will not try and directly access the
        equivalent, but some stern notes in the documentation saying only
        to use documented interface might help.

        If we can test the updates and especially if someone can test it with
        older versions of Trac that would be great, but even if we can't will
        mark issue as resolved after a short wait.

        Show
        grahamd Graham Dumpleton added a comment - I have committed some changes into repository now which fixes the issue with add_field() but also does things in such a way that code such as Trac which creates Field instances direct and adds them to the the list of form fields will still work. It does this by using a derived list class instance which invalidates the key lookup table on changes to the list so that the lookup table is rebuilt the next time it is required. Doing this wasn't as bad as I first thought it would be as hadn't gone to the extent of properly understanding the code before. Also include a semi hack in Field class constructor to detect old style use where code expects more than just the name of the field as argument and do appropriate initialisation as appropriate. In all, what the changes allow is for the following to work: from mod_python import util from StringIO import StringIO def index(req): req.content_type = 'text/plain' req.write('%s\n\n' % req.form.keys()) for key in req.form.keys(): req.write('%s %s\n' % (key, req.form.getlist(key))) req.write('\n') req.form.add_field('e', 'f') req.form.list.append(util.Field('g', StringIO('h'), "text/plain", {}, None, {})) req.form.list.extend([util.Field('i', StringIO('j'), "text/plain", {}, None, {})]) req.form.list += [util.Field('k', StringIO('l'), "text/plain", {}, None, {})] req.write('%s\n\n' % req.form.keys()) for key in req.form.keys(): req.write('%s %s\n' % (key, req.form.getlist(key))) req.write('\n') return None If this is accessed with: http://localhost:8002/~grahamd/form/?a=b&c=d&e=f I get: ['a', 'c', 'e'] a [Field('a', 'b')] c [Field('c', 'd')] e [Field('e', 'f')] ['a', 'c', 'e', 'g', 'i', 'k'] a [Field('a', 'b')] c [Field('c', 'd')] e [Field('e', 'f'), Field('e', 'f')] g [Field('g', 'h')] i [Field('i', 'j')] k [Field('k', 'l')] As the req.form.keys() method works off the lookup table, you can see how updates direct to list work okay, meaning stuff like Trac should work. You will also note that the 'dictionary' attribute has gone and is hidden elsewhere. Hopefully people will not try and directly access the equivalent, but some stern notes in the documentation saying only to use documented interface might help. If we can test the updates and especially if someone can test it with older versions of Trac that would be great, but even if we can't will mark issue as resolved after a short wait.
        Hide
        grahamd Graham Dumpleton added a comment -

        Okay, bit the bullet and made FieldStorage a bit more dictionary like after a bit of discussion on the mailing lists by adding _setitem(), __delitem(), __iter(), __repr_() and clear() on top of other dictionary like methods it already had. To say it is dictionary like is probably only true though for the case where there is a single value against all form fields. Once you start having multiple values against a form field it starts to behave a bit differently. That it isn't strictly the same has been documented.

        First difference is that subscript operator when used to access fields returns the actual value when there is only one, but a list of values when there is more than one. To always ensure one only gets a single value, should use getfirst(), if always a list, use getlist() instead. The way the subscript operator behaves is the same as cgi.FieldStorage so not doing anything different here.

        Second difference is that subscript operator when used to set the value of a field actually adds the value to an existing value if one already exists. In other words, it does not replace the existing value but adds another. This is because it uses add_field(). There is no equivalent in cgi.FieldStorage, but this behaviour is consistent with how Trac was overriding the class and providing its own _setitem_() method. If one really wants to supplant an existing value, should use del operator first to remove existing values.

        So there is some strangeness in how it works. If anyone specifically sees a big problem with the _setitem_() method and how it works, speak up.

        Show
        grahamd Graham Dumpleton added a comment - Okay, bit the bullet and made FieldStorage a bit more dictionary like after a bit of discussion on the mailing lists by adding _ setitem (), __delitem (), __iter (), __repr _() and clear() on top of other dictionary like methods it already had. To say it is dictionary like is probably only true though for the case where there is a single value against all form fields. Once you start having multiple values against a form field it starts to behave a bit differently. That it isn't strictly the same has been documented. First difference is that subscript operator when used to access fields returns the actual value when there is only one, but a list of values when there is more than one. To always ensure one only gets a single value, should use getfirst(), if always a list, use getlist() instead. The way the subscript operator behaves is the same as cgi.FieldStorage so not doing anything different here. Second difference is that subscript operator when used to set the value of a field actually adds the value to an existing value if one already exists. In other words, it does not replace the existing value but adds another. This is because it uses add_field(). There is no equivalent in cgi.FieldStorage, but this behaviour is consistent with how Trac was overriding the class and providing its own _ setitem _() method. If one really wants to supplant an existing value, should use del operator first to remove existing values. So there is some strangeness in how it works. If anyone specifically sees a big problem with the _ setitem _() method and how it works, speak up.
        Hide
        grahamd Graham Dumpleton added a comment -

        Not heard of anyone who has specifically tested to see if this works with older versions of Trac or not, but code passing tests on various Python/Apache versions so that's good enough for now.

        Show
        grahamd Graham Dumpleton added a comment - Not heard of anyone who has specifically tested to see if this works with older versions of Trac or not, but code passing tests on various Python/Apache versions so that's good enough for now.
        Hide
        grahamd Graham Dumpleton added a comment -

        After further experimentation, following the Trac model of having _setitem_() be an alias for add_field() is a pain in the neck. As reminder, the Trac way of doing things mean't that assigning using subscript operator resulted in additional field values being added rather than existing ones being replaced as the subscript operator works in every other dictionary like data structure.

        Having it work this way, it is too easy to forget it is different and your code doesn't work as expected. For newbies this could cause a lot of grief. Thus, how Trac did things should be ignored and need to make _setitem_() replace all existing values for field like a normal dictionary. If someone wants to add additional values, they should use add_field() instead.

        Doing this will not cause any problems for Trac compatibility as Trac overrides _setitem_() in a derived class anyway.

        Show
        grahamd Graham Dumpleton added a comment - After further experimentation, following the Trac model of having _ setitem _() be an alias for add_field() is a pain in the neck. As reminder, the Trac way of doing things mean't that assigning using subscript operator resulted in additional field values being added rather than existing ones being replaced as the subscript operator works in every other dictionary like data structure. Having it work this way, it is too easy to forget it is different and your code doesn't work as expected. For newbies this could cause a lot of grief. Thus, how Trac did things should be ignored and need to make _ setitem _() replace all existing values for field like a normal dictionary. If someone wants to add additional values, they should use add_field() instead. Doing this will not cause any problems for Trac compatibility as Trac overrides _ setitem _() in a derived class anyway.

          People

          • Assignee:
            grahamd Graham Dumpleton
            Reporter:
            jgallacher Jim Gallacher
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development