Apache AWF
  1. Apache AWF
  2. AWF-176

Use CharsetDecoder to convert ByteBuffer to String

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Labels:
      None

      Description

      My research has shown that CharsetDecoders are much faster for the conversion of a ByteBuffer into a String.

      Take a look here for this simple improvement:
      https://github.com/ilmich/deft/commit/6914e922051d4e0e050a58daea2790091c236b7c

        Activity

        Hide
        Julien Vermillard added a comment -

        Why do you change charset to UTF-8 ?

        Show
        Julien Vermillard added a comment - Why do you change charset to UTF-8 ?
        Hide
        Michele Zuccalà added a comment -

        why not!??!?!

        Show
        Michele Zuccalà added a comment - why not!??!?!
        Hide
        Michele Zuccalà added a comment -

        I've noticed that old code use two different charsets for the same things.
        Then I've thinked that UTF-8 can be good enough, but please explain me what are eventually wrong

        Show
        Michele Zuccalà added a comment - I've noticed that old code use two different charsets for the same things. Then I've thinked that UTF-8 can be good enough, but please explain me what are eventually wrong
        Hide
        Julien Vermillard added a comment -

        if it's only for decoding HTTP headers, I think good old ASCII 7 is enough, we need to check what the http rfc2616 require.

        Show
        Julien Vermillard added a comment - if it's only for decoding HTTP headers, I think good old ASCII 7 is enough, we need to check what the http rfc2616 require.
        Hide
        Julien Vermillard added a comment -

        after checking the RFC : US-ASCII is OK, UTF-8 is an overkill there

        Show
        Julien Vermillard added a comment - after checking the RFC : US-ASCII is OK, UTF-8 is an overkill there
        Hide
        Michele Zuccalà added a comment -

        You are right, i've take a fast look of rfc2616.
        Seems that HTTP protocol talk only in US_ASCII and other charsets are supported at higher level.
        i'll fix my code.

        but do you think that this patch can be useful!??!?!

        thanks

        Show
        Michele Zuccalà added a comment - You are right, i've take a fast look of rfc2616. Seems that HTTP protocol talk only in US_ASCII and other charsets are supported at higher level. i'll fix my code. but do you think that this patch can be useful!??!?! thanks
        Hide
        Roger Schildmeijer added a comment -

        ilmich, do you mind creating a proper patch against sandbox/trunk? Do that and I'll review it

        Show
        Roger Schildmeijer added a comment - ilmich, do you mind creating a proper patch against sandbox/trunk? Do that and I'll review it
        Hide
        Séven Le Mesle added a comment -

        Beware that the request body can use a different Charset. Headers are always US_ASCII but that can be quiet different for the body.
        Also notice that decoder has an internal state that should be flushed after use. The JavaDoc recommends to call first the reset method.
        See http://download.oracle.com/javase/6/docs/api/java/nio/charset/CharsetDecoder.html

        Show
        Séven Le Mesle added a comment - Beware that the request body can use a different Charset. Headers are always US_ASCII but that can be quiet different for the body. Also notice that decoder has an internal state that should be flushed after use. The JavaDoc recommends to call first the reset method. See http://download.oracle.com/javase/6/docs/api/java/nio/charset/CharsetDecoder.html
        Hide
        Roger Schildmeijer added a comment -

        good point Séven. We have https://issues.apache.org/jira/browse/DEFT-120 (HTTP Post decoding) that should solve the encoding for the body

        Show
        Roger Schildmeijer added a comment - good point Séven. We have https://issues.apache.org/jira/browse/DEFT-120 (HTTP Post decoding) that should solve the encoding for the body
        Hide
        Michele Zuccalà added a comment - - edited

        I'm a bit confused

        1) in Delft I have not found a clear distinction between the decoding of header and body, then what is the best charset !??!?!
        2) Deft can be multithreaded, so it would be safer to instantiate a new CharsetDecoder every request !?!??!

        Show
        Michele Zuccalà added a comment - - edited I'm a bit confused 1) in Delft I have not found a clear distinction between the decoding of header and body, then what is the best charset !??!?! 2) Deft can be multithreaded, so it would be safer to instantiate a new CharsetDecoder every request !?!??!
        Hide
        Roger Schildmeijer added a comment -

        Sorry about the confusion (Some of the confusion is probably my bad).

        The encoding/decoding code in the web server component (HttpRequest.of and HttpRequest.continueParsing) is not perfect and needs an overhaul..
        I think Séven is correct in the comment above regarding the charsets.

        Show
        Roger Schildmeijer added a comment - Sorry about the confusion (Some of the confusion is probably my bad). The encoding/decoding code in the web server component (HttpRequest.of and HttpRequest.continueParsing) is not perfect and needs an overhaul.. I think Séven is correct in the comment above regarding the charsets.
        Hide
        Michele Zuccalà added a comment - - edited

        don't worry, Roger
        I have a bit of trouble finding a clear documentation about this subject.
        However after some reading, the best solution seems to be this:

        1) protocol-level encode/decode with default charset ISO-8859-1 or US-ASCII
        2) other charset like utf-8, encoded/decoded by the application developer

        regarding CharsetDecoder, the javadoc says:

        "Instances of this class are not safe for use by multiple concurrent threads."

        so it seems safer to link a new instance of CharsetDecoder every single IOLoop(or better in HttpProtocol class), right ?!?!?

        can you tell me where I find the new repository!?!?
        if you agree I would for now, continue to update my old github fork, to generate the patches more quickly.

        Show
        Michele Zuccalà added a comment - - edited don't worry, Roger I have a bit of trouble finding a clear documentation about this subject. However after some reading, the best solution seems to be this: 1) protocol-level encode/decode with default charset ISO-8859-1 or US-ASCII 2) other charset like utf-8, encoded/decoded by the application developer regarding CharsetDecoder, the javadoc says: "Instances of this class are not safe for use by multiple concurrent threads." so it seems safer to link a new instance of CharsetDecoder every single IOLoop(or better in HttpProtocol class), right ?!?!? can you tell me where I find the new repository!?!? if you agree I would for now, continue to update my old github fork, to generate the patches more quickly.
        Hide
        Roger Schildmeijer added a comment -

        don't worry, Roger
        I have a bit of trouble finding a clear documentation about this subject.

        Thats probably why the existing encoding/decoding issues (in the current code base) ended up there from the beginning.

        However after some reading, the best solution seems to be this:

        (Thanks for the research)

        1) protocol-level encode/decode with default charset ISO-8859-1 or US-ASCII
        2) other charset like utf-8, encoded/decoded by the application developer

        Ok.

        regarding CharsetDecoder, the javadoc says:

        "Instances of this class are not safe for use by multiple concurrent threads."

        so it seems safer to link a new instance of CharsetDecoder every single IOLoop, right ?!?!?

        Sounds reasonable. Take a look at the ThreadLocal MessageDigest in org.deftserver.util.HttpUtil

        can you tell me where I find the new repository!?!?

        Take a look at http://incubator.apache.org/projects/deft.html.
        The svn repo is located at: https://svn.apache.org/repos/asf/incubator/deft/

        if you agree I would for now, continue to update my old github fork, to generate the patches more quickly.

        My best advice is to svn checkout from the Apache repo. (you could still use git locally, I do that..)

        Show
        Roger Schildmeijer added a comment - don't worry, Roger I have a bit of trouble finding a clear documentation about this subject. Thats probably why the existing encoding/decoding issues (in the current code base) ended up there from the beginning. However after some reading, the best solution seems to be this: (Thanks for the research) 1) protocol-level encode/decode with default charset ISO-8859-1 or US-ASCII 2) other charset like utf-8, encoded/decoded by the application developer Ok. regarding CharsetDecoder, the javadoc says: "Instances of this class are not safe for use by multiple concurrent threads." so it seems safer to link a new instance of CharsetDecoder every single IOLoop, right ?!?!? Sounds reasonable. Take a look at the ThreadLocal MessageDigest in org.deftserver.util.HttpUtil can you tell me where I find the new repository!?!? Take a look at http://incubator.apache.org/projects/deft.html . The svn repo is located at: https://svn.apache.org/repos/asf/incubator/deft/ if you agree I would for now, continue to update my old github fork, to generate the patches more quickly. My best advice is to svn checkout from the Apache repo. (you could still use git locally, I do that..)
        Hide
        Michele Zuccalà added a comment -

        ok, I will attach a patch soon
        Thanks Roger.

        Show
        Michele Zuccalà added a comment - ok, I will attach a patch soon Thanks Roger.
        Hide
        Séven Le Mesle added a comment -

        I am working on a Buffer Oriented HTTP Parser the code is available on my fork but I should integrate updates to the branch prior to propose the patch. This parser will be able to handle different encoding.

        Show
        Séven Le Mesle added a comment - I am working on a Buffer Oriented HTTP Parser the code is available on my fork but I should integrate updates to the branch prior to propose the patch. This parser will be able to handle different encoding.
        Hide
        Michele Zuccalà added a comment -

        After further research I found that using the method

        CharsetDecoder.decode (ByteBuffer)

        there is no need to reset, decode and flush () because it's automatic.
        So I think that with the current implementation of the HTTP parser, my first patch is just fine.
        I just have to change the charsets in US_ASCII and ISO-8859-1.

        Show
        Michele Zuccalà added a comment - After further research I found that using the method CharsetDecoder.decode (ByteBuffer) there is no need to reset, decode and flush () because it's automatic. So I think that with the current implementation of the HTTP parser, my first patch is just fine. I just have to change the charsets in US_ASCII and ISO-8859-1.

          People

          • Assignee:
            Unassigned
            Reporter:
            Michele Zuccalà
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development