Wicket
  1. Wicket
  2. WICKET-1627

AbstractRequestTargetUrlCodingStrategy improper user of URLEncoder.encode

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.1, 1.3.2, 1.3.3, 1.4-M1
    • Fix Version/s: 1.4-RC1
    • Component/s: wicket
    • Labels:
      None
    • Environment:
      Tomcat or Jetty (probably others)

      Description

      The use of URLEncoder.encode is incorrect in this scenario. The URLEncoder is meant for encoding query string values - not values that appear in the path portion of a URI.

      Because the AbstractRequestTargetUrlCodingStrategy is used by other classes to encode values that appear in the path, problems can occur when that path has spaces. For example, the parameter "message with spaces and+some+pluses" is encoded as follows in a URL:

      http://localhost:8080/bugs/home/message/message+with+spaces+and%2Bsome%2Bpluses/

      However, the resulting request.getServletPath() call returns this:

      /home/message/message+with+spaces+and+some+plusses=bug/

      Note that the + in the path are not turned back into spaces. This is the correct behavior and is seen in both Tomcat and Jetty.

      See the RFC (http://www.ietf.org/rfc/rfc2396.txt) for a full description of what should or should not be encoded.

      /**

      • Url encodes a string
      • @param string
      • string to be encoded
      • @return encoded string
        */
        protected String urlEncode(String string)
        {
        try { return URLEncoder.encode(string, Application.get().getRequestCycleSettings() .getResponseRequestEncoding()); }

        catch (UnsupportedEncodingException e)

        { log.error(e.getMessage(), e); return string; }

      }

        Activity

        Hide
        Johan Compagner added a comment -

        ehmm if i have this param:

        message=message with spaces and+some+plusses

        i add that to an url. Then no matter what url coding strategy i am using
        if i do later on:

        PageParameters.getString("message") i want exactly this: "message with spaces and+some+plusses" as the value..

        so how to encode that correctly?

        Show
        Johan Compagner added a comment - ehmm if i have this param: message=message with spaces and+some+plusses i add that to an url. Then no matter what url coding strategy i am using if i do later on: PageParameters.getString("message") i want exactly this: "message with spaces and+some+plusses" as the value.. so how to encode that correctly?
        Hide
        Doug Donohoe added a comment -

        I have code working to fix this. It is a bit involved - it took me all day to figure it out and I had to dig into the RFC2396 as well as jetty/tomcat source.

        I'll post the code shortly. I think there is another possible bug related to this area which I'll also note shortly.

        The problem boils down to the fact that you have to URL encode differently based on whether it is a query string parameter (?foo=bar) or a path parameter (/foo/bar/).

        Show
        Doug Donohoe added a comment - I have code working to fix this. It is a bit involved - it took me all day to figure it out and I had to dig into the RFC2396 as well as jetty/tomcat source. I'll post the code shortly. I think there is another possible bug related to this area which I'll also note shortly. The problem boils down to the fact that you have to URL encode differently based on whether it is a query string parameter (?foo=bar) or a path parameter (/foo/bar/).
        Hide
        Doug Donohoe added a comment -

        Here is the patch which address both 1624 and 1627.

        I had to create a WicketURLEncoder and WicketURLDecoder to replace java.net.URLEncoder and java.net.URLDecoder respectively.

        I search in the wicket project and replaced the usages with this new class. I also deprecated the RequestUtils decode/encode API.

        See the RFC http://www.ietf.org/rfc/rfc2396.txt for details of what is and is not allowed in URLs. The main issue boils down to the encoding of spaces with '+' as is done by java.net.URLEncoder. While this is okay for query parameters, it is not for the path portion of a URL. The server will not decode +'s to spaces in a path component.

        There are still some things that need to be addressed:

        1) Hardcoding of "UTF-8" versus Application.get().getRequestCycleSettings().getResponseRequestEncoding()

        There are many places (notably RequestCycle and RequestUtils) that hardcode UTF-8 when decoding and encoding values. Cleary this will cause an issue if the application's request encoding is not UTF-8.

        2) I think RequestCycle.urlFor has a bug in it. I noted this in the comments. Essentially, values are beeing URL-encoded and put into PageParameters. This likely leads to double-encoding if a mount strategy is used.

        I would request that developers with more experience in the code base review my changes and comment on #1 and #2 above.

        Show
        Doug Donohoe added a comment - Here is the patch which address both 1624 and 1627. I had to create a WicketURLEncoder and WicketURLDecoder to replace java.net.URLEncoder and java.net.URLDecoder respectively. I search in the wicket project and replaced the usages with this new class. I also deprecated the RequestUtils decode/encode API. See the RFC http://www.ietf.org/rfc/rfc2396.txt for details of what is and is not allowed in URLs. The main issue boils down to the encoding of spaces with '+' as is done by java.net.URLEncoder. While this is okay for query parameters, it is not for the path portion of a URL. The server will not decode +'s to spaces in a path component. There are still some things that need to be addressed: 1) Hardcoding of "UTF-8" versus Application.get().getRequestCycleSettings().getResponseRequestEncoding() There are many places (notably RequestCycle and RequestUtils) that hardcode UTF-8 when decoding and encoding values. Cleary this will cause an issue if the application's request encoding is not UTF-8. 2) I think RequestCycle.urlFor has a bug in it. I noted this in the comments. Essentially, values are beeing URL-encoded and put into PageParameters. This likely leads to double-encoding if a mount strategy is used. I would request that developers with more experience in the code base review my changes and comment on #1 and #2 above.
        Hide
        Doug Donohoe added a comment -

        I've revised the patch:

        1) Moved the fetching of the requerst/response encoding into the WicketURLEncoder/WicketURLDecoder. I did this because every caller was doing it anyhow and it cleans up the code.

        2) Removed hard-coded UTF-8 references used for encoding/decoding. Now all classes use the application setting.

        Show
        Doug Donohoe added a comment - I've revised the patch: 1) Moved the fetching of the requerst/response encoding into the WicketURLEncoder/WicketURLDecoder. I did this because every caller was doing it anyhow and it cleans up the code. 2) Removed hard-coded UTF-8 references used for encoding/decoding. Now all classes use the application setting.
        Hide
        Johan Compagner added a comment -

        WebRequestCodingStrategy.
        protected CharSequence encode(RequestCycle requestCycle,
        IListenerInterfaceRequestTarget requestTarget)

        there we create an url without any encoding

        for example i expected more test to fail this is for example an output StatelessStatefullUrlCodingStrategyTest
        and then the file: StatefulPage_QueryString_Result.html

        there this is the result:

        <a href="?wicket:bookmarkablePage=%3Aorg.apache.wicket.markup.html.autolink.Index" wicket:id="indexLink">go to index</a>
        <a href="?wicket:interface=:0:actionLink::ILinkListener::" wicket:id="actionLink">Link clicked <span wicket:id="linkClickCount">0</span> times</a>
        <form action="statefull?wicket:interface=%3A0%3Astatelessform%3A%3AIFormSubmitListener%3A%3A" wicket:id="statelessform" method="post" id="statelessform1"><div style="display:none"><input type="hidden" name="statelessform1_hf_0" id="statelessform1_hf_0" /></div>

        so you see the frist is a bookmarkable that seems to have encoded
        the second is a call that goes through the above encode method and nothing is done
        the third is a mount encoder that does encode the wicket:interface
        Why is for example there : in wicket:interface not encoded but it is in the value? (%3A)

        and the question is is it really needed? i like the clean ?wicket:interface=:0:actionLink::ILinkListener::

        i guess we should make a few more test also with QueryCoding strategy where the mount has utf8? and params also have some stuff that needs to be encoded

        Show
        Johan Compagner added a comment - WebRequestCodingStrategy. protected CharSequence encode(RequestCycle requestCycle, IListenerInterfaceRequestTarget requestTarget) there we create an url without any encoding for example i expected more test to fail this is for example an output StatelessStatefullUrlCodingStrategyTest and then the file: StatefulPage_QueryString_Result.html there this is the result: <a href="?wicket:bookmarkablePage=%3Aorg.apache.wicket.markup.html.autolink.Index" wicket:id="indexLink">go to index</a> <a href="?wicket:interface=:0:actionLink::ILinkListener::" wicket:id="actionLink">Link clicked <span wicket:id="linkClickCount">0</span> times</a> <form action="statefull?wicket:interface=%3A0%3Astatelessform%3A%3AIFormSubmitListener%3A%3A" wicket:id="statelessform" method="post" id="statelessform1"><div style="display:none"><input type="hidden" name="statelessform1_hf_0" id="statelessform1_hf_0" /></div> so you see the frist is a bookmarkable that seems to have encoded the second is a call that goes through the above encode method and nothing is done the third is a mount encoder that does encode the wicket:interface Why is for example there : in wicket:interface not encoded but it is in the value? (%3A) and the question is is it really needed? i like the clean ?wicket:interface=:0:actionLink::ILinkListener:: i guess we should make a few more test also with QueryCoding strategy where the mount has utf8? and params also have some stuff that needs to be encoded
        Hide
        Doug Donohoe added a comment - - edited

        RFC 2396 states that the colon is a reserved symbol in query strings. Thus, anything after a ? needs to have the colon encoded. Anything prior to a ? can have a colon.

        The WicketURLEncoder / WicketURLDecoder I created in the patch tries to capture these rules.

        I'll have to look at the examples above.

        Show
        Doug Donohoe added a comment - - edited RFC 2396 states that the colon is a reserved symbol in query strings. Thus, anything after a ? needs to have the colon encoded. Anything prior to a ? can have a colon. The WicketURLEncoder / WicketURLDecoder I created in the patch tries to capture these rules. I'll have to look at the examples above.
        Hide
        Johan Compagner added a comment -

        hmm but that is horrible if a colon is encoded
        because then the default wicket url (the redirect url) will be completely blown up.
        and as far as i can see it works fine.
        The strange thing is that colons in the url before the ? looks weird to me.. in the query it looks fine.

        Show
        Johan Compagner added a comment - hmm but that is horrible if a colon is encoded because then the default wicket url (the redirect url) will be completely blown up. and as far as i can see it works fine. The strange thing is that colons in the url before the ? looks weird to me.. in the query it looks fine.
        Hide
        Doug Donohoe added a comment - - edited

        The RFC says that a ":" in the query string is reserved. So, technically any wicket URL that passed a ":" in the query string that is not encoded is in violation of the RFC. Why the ":" is reserved is not clear (the RFC is not specific, but I suspect it is due to the use of colon earlier in the path like http://foo:8080).

        What happens in practice is obviously another matter. It clearly seems to work not encoded. We could perhaps mark the ':' as not needing encoding in the WicketURLEncoder.QUERY_INSTANCE. Will this cause problems in the future? Unlikely. However, does Wicket wish to violate the RFC purposefully?

        Regardless of the disposition of the colon, there are clearly places where encoding is done inconsistently, incorrectly or perhaps the worst violation: double-encoding. There is a very real need to encode path components differently from query string components. My patch attempts to lay the framework for the correct encoding logic by clearly having path and query encoding instances.

        Perhaps other Wicket developers can comment here. I've looked at the code a little more in-depth and there doesn't seem to be a centralized place that path/query encoding is done. What is the history here?

        Show
        Doug Donohoe added a comment - - edited The RFC says that a ":" in the query string is reserved. So, technically any wicket URL that passed a ":" in the query string that is not encoded is in violation of the RFC. Why the ":" is reserved is not clear (the RFC is not specific, but I suspect it is due to the use of colon earlier in the path like http://foo:8080 ). What happens in practice is obviously another matter. It clearly seems to work not encoded. We could perhaps mark the ':' as not needing encoding in the WicketURLEncoder.QUERY_INSTANCE. Will this cause problems in the future? Unlikely. However, does Wicket wish to violate the RFC purposefully? Regardless of the disposition of the colon, there are clearly places where encoding is done inconsistently, incorrectly or perhaps the worst violation: double-encoding. There is a very real need to encode path components differently from query string components. My patch attempts to lay the framework for the correct encoding logic by clearly having path and query encoding instances. Perhaps other Wicket developers can comment here. I've looked at the code a little more in-depth and there doesn't seem to be a centralized place that path/query encoding is done. What is the history here?
        Hide
        Doug Donohoe added a comment - - edited

        As an example of the inconsistency:

        When I have a normal form in my page [form = new Form(...)], the action gets output as follows:

        action="?wicket:interface=:1:form::IFormSubmitListener::"

        When I make the change to [form = new StatelessForm(...)], changing nothing else, the action changes to:

        action="search///0?wicket%3Ainterface=%3A1%3Aform%3A%3AIFormSubmitListener%3A%3A"

        Aside from the "search///0" addition, which is expected due to the mount strategy I'm using, the query string is URL encoded.

        Show
        Doug Donohoe added a comment - - edited As an example of the inconsistency: When I have a normal form in my page [form = new Form(...)] , the action gets output as follows: action="?wicket:interface=:1:form::IFormSubmitListener::" When I make the change to [form = new StatelessForm(...)] , changing nothing else, the action changes to: action="search/ / /0?wicket%3Ainterface=%3A1%3Aform%3A%3AIFormSubmitListener%3A%3A" Aside from the "search/ / /0" addition, which is expected due to the mount strategy I'm using, the query string is URL encoded.
        Hide
        Doug Donohoe added a comment -

        I strongly suspect that this bug:

        https://issues.apache.org/jira/browse/WICKET-1580

        is closely related.

        Show
        Doug Donohoe added a comment - I strongly suspect that this bug: https://issues.apache.org/jira/browse/WICKET-1580 is closely related.
        Hide
        Doug Donohoe added a comment -

        I'm working on a test to show how the four types of forms behave (all combinations of Form/Statelessform with GET/POST):

        RUNNING: org.apache.wicket.integration.encoding.StatefulGetPage
        <form action="" wicket:id="form" id="form1" method="get">
        <input type="hidden" name="form1_hf_0" id="form1_hf_0" />
        <input type="hidden" name="wicket:interface" value=":0:form::IFormSubmitListener::" />

        RUNNING: org.apache.wicket.integration.encoding.StatefulPostPage
        <form action="?wicket:interface=:1:form::IFormSubmitListener::" wicket:id="form" id="form2" method="post">
        <input type="hidden" name="form2_hf_0" id="form2_hf_0" />

        RUNNING: org.apache.wicket.integration.encoding.StatelessGetPage
        <form action="" wicket:id="form" id="form3" method="get">
        <input type="hidden" name="form3_hf_0" id="form3_hf_0" />
        <input type="hidden" name="wicket:bookmarkablePage" value="%3Aorg.apache.wicket.integration.encoding.StatelessGetPage" />
        <input type="hidden" name="wicket:interface" value="%3A2%3Aform%3A%3AIFormSubmitListener%3A%3A" />

        RUNNING: org.apache.wicket.integration.encoding.StatelessPostPage
        <form action="?wicket:bookmarkablePage=%3Aorg.apache.wicket.integration.encoding.StatelessPostPage&wicket:interface=%3A3%3Aform%3A%3AIFormSubmitListener%3A%3A" wicket:id="form" id="form4" method="post">
        <input type="hidden" name="form4_hf_0" id="form4_hf_0" />

        As you can see, sometimes the wicket:interface value is url-encoded; other times it is not. In the Stateless Get example, the wicket:interface is incorrectly URL encoded (it should not be because the browser does a 2nd URL-encode when submitting which causes BUG 1580). It should be HTML escaped however.

        Show
        Doug Donohoe added a comment - I'm working on a test to show how the four types of forms behave (all combinations of Form/Statelessform with GET/POST): RUNNING: org.apache.wicket.integration.encoding.StatefulGetPage <form action="" wicket:id="form" id="form1" method="get"> <input type="hidden" name="form1_hf_0" id="form1_hf_0" /> <input type="hidden" name="wicket:interface" value=":0:form::IFormSubmitListener::" /> RUNNING: org.apache.wicket.integration.encoding.StatefulPostPage <form action="?wicket:interface=:1:form::IFormSubmitListener::" wicket:id="form" id="form2" method="post"> <input type="hidden" name="form2_hf_0" id="form2_hf_0" /> RUNNING: org.apache.wicket.integration.encoding.StatelessGetPage <form action="" wicket:id="form" id="form3" method="get"> <input type="hidden" name="form3_hf_0" id="form3_hf_0" /> <input type="hidden" name="wicket:bookmarkablePage" value="%3Aorg.apache.wicket.integration.encoding.StatelessGetPage" /> <input type="hidden" name="wicket:interface" value="%3A2%3Aform%3A%3AIFormSubmitListener%3A%3A" /> RUNNING: org.apache.wicket.integration.encoding.StatelessPostPage <form action="?wicket:bookmarkablePage=%3Aorg.apache.wicket.integration.encoding.StatelessPostPage&wicket:interface=%3A3%3Aform%3A%3AIFormSubmitListener%3A%3A" wicket:id="form" id="form4" method="post"> <input type="hidden" name="form4_hf_0" id="form4_hf_0" /> As you can see, sometimes the wicket:interface value is url-encoded; other times it is not. In the Stateless Get example, the wicket:interface is incorrectly URL encoded (it should not be because the browser does a 2nd URL-encode when submitting which causes BUG 1580). It should be HTML escaped however.
        Hide
        Doug Donohoe added a comment - - edited

        So this is the most up-to-date RFC on the topic:

        http://tools.ietf.org/html/rfc3986#section-3.4

        This version also seems to indicate that the colon is reserved, but the query definition doesn't make explicit use of it, so I think it is safe to use the : in the query without URL-encoding...

        Show
        Doug Donohoe added a comment - - edited So this is the most up-to-date RFC on the topic: http://tools.ietf.org/html/rfc3986#section-3.4 This version also seems to indicate that the colon is reserved, but the query definition doesn't make explicit use of it, so I think it is safe to use the : in the query without URL-encoding...
        Hide
        Doug Donohoe added a comment -

        Ok, I've create a patch and fixed all the test cases. Prior to this pach, Wicket was inconsistent about when it encoded query params. For example, there were many instances where you would get

        &lta href="?wicket:bookmarkablePage=%3Aorg.apache.wicket.markup.html.link.XmlPage">Home</a>

        Note the : before the = and the %3A after.

        This patch makes everything totally consistent. I had to adjust the expected results for several unit tests.

        So, the summary of changes:

        1) New WicketURLEncoder and WicketURLDecoder classes with static instances for QUERY and PATH components.
        2) Consistent and correct usage of these classes. I think I've found every place in the code and use the appropriate one based on case.
        3) Fixed the double-unencoding of the servlet path.
        4) Fixed the double-encoding of some query params in requestCycle.urlFor
        5) Fixed incorrect Form hidden field encoding

        Show
        Doug Donohoe added a comment - Ok, I've create a patch and fixed all the test cases. Prior to this pach, Wicket was inconsistent about when it encoded query params. For example, there were many instances where you would get &lta href="?wicket:bookmarkablePage=%3Aorg.apache.wicket.markup.html.link.XmlPage">Home</a> Note the : before the = and the %3A after. This patch makes everything totally consistent. I had to adjust the expected results for several unit tests. So, the summary of changes: 1) New WicketURLEncoder and WicketURLDecoder classes with static instances for QUERY and PATH components. 2) Consistent and correct usage of these classes. I think I've found every place in the code and use the appropriate one based on case. 3) Fixed the double-unencoding of the servlet path. 4) Fixed the double-encoding of some query params in requestCycle.urlFor 5) Fixed incorrect Form hidden field encoding
        Hide
        Doug Donohoe added a comment -

        Minor change to patch - updated WicketURLEncoder to encode ; after I realized it is used in paths to separate jsession-id. Also made class more subclassable.

        Show
        Doug Donohoe added a comment - Minor change to patch - updated WicketURLEncoder to encode ; after I realized it is used in paths to separate jsession-id. Also made class more subclassable.
        Hide
        Johan Compagner added a comment -

        i applied the patch. what i still find strange is that we still have places where we dont encode stuff like at least 2 encode methods of WebRequestCodingStrategy, i guess we also need to look at that a bit more

        Show
        Johan Compagner added a comment - i applied the patch. what i still find strange is that we still have places where we dont encode stuff like at least 2 encode methods of WebRequestCodingStrategy, i guess we also need to look at that a bit more
        Hide
        Doug Donohoe added a comment -

        Thanks Johan. I'll take a look at that class to see if I can offer any insight.

        Show
        Doug Donohoe added a comment - Thanks Johan. I'll take a look at that class to see if I can offer any insight.
        Hide
        Martijn Dashorst added a comment -

        Moved to next milestone release.

        Show
        Martijn Dashorst added a comment - Moved to next milestone release.
        Hide
        Igor Vaynberg added a comment -

        doug, are we ready to close this and 1624?

        Show
        Igor Vaynberg added a comment - doug, are we ready to close this and 1624?
        Hide
        Doug Donohoe added a comment -

        Yes.

        Show
        Doug Donohoe added a comment - Yes.

          People

          • Assignee:
            Igor Vaynberg
            Reporter:
            Doug Donohoe
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development