HttpComponents HttpClient
  1. HttpComponents HttpClient
  2. HTTPCLIENT-293

Provide support for non-ASCII charsets in the multipart disposition-content header

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0 Alpha
    • Fix Version/s: 4.0 Beta 2
    • Component/s: HttpMime
    • Labels:
      None
    • Environment:
      Operating System: All
      Platform: All

      Description

      Because of the the following line in getAsciiBytes
      data.getBytes("US-ASCII");

      The returned string is modified if has Latin Characters.

      Ex : Document non-controlé -> Document non-control?

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        1851d 12h 1m 1 Oleg Kalnichevski 02/Dec/08 12:50
        Resolved Resolved Closed Closed
        783d 22h 8m 1 Oleg Kalnichevski 25/Jan/11 10:58
        Mark Thomas made changes -
        Workflow jira [ 12581088 ] Default workflow, editable Closed status [ 12606883 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12557910 ] jira [ 12581088 ]
        Mark Thomas made changes -
        Workflow jira [ 12362757 ] Default workflow, editable Closed status [ 12557910 ]
        Oleg Kalnichevski made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Oleg Kalnichevski made changes -
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Hide
        Oleg Kalnichevski added a comment -

        MultipartEntity now encodes non-ASCII characters in the disposition-content header using content charset when used in the browser compatibility mode and replaces non-ASCII characters with ? when used in the strict mode. One always has an option to encode the file name using one of the standard encoding mechanisms as described in RFC2231 and RFC2047.

        Closing this issue as resolved.

        Oleg

        Show
        Oleg Kalnichevski added a comment - MultipartEntity now encodes non-ASCII characters in the disposition-content header using content charset when used in the browser compatibility mode and replaces non-ASCII characters with ? when used in the strict mode. One always has an option to encode the file name using one of the standard encoding mechanisms as described in RFC2231 and RFC2047. Closing this issue as resolved. Oleg
        Oleg Kalnichevski made changes -
        Fix Version/s 4.0 Alpha 5 [ 12313110 ]
        Fix Version/s 4.0 beta 2 [ 12313164 ]
        Oleg Kalnichevski made changes -
        Fix Version/s 4.0 Alpha 5 [ 12313110 ]
        Fix Version/s 4.0 Final [ 12311094 ]
        Roland Weber made changes -
        Component/s HttpClient [ 12311010 ]
        Component/s HttpMime [ 12312149 ]
        Hide
        Ortwin Glück added a comment -

        Interesting, although I have never seen it being used in the wild. By the way, RFC 2184 is obsoleted by RFC 2231.

        Show
        Ortwin Glück added a comment - Interesting, although I have never seen it being used in the wild. By the way, RFC 2184 is obsoleted by RFC 2231.
        Hide
        Mohammad Rezaei added a comment -

        Ortwin, I think the RFC is worded strangely. It is certainly true that Section 2.3 says US-ASCII only, but it seems like that section is outdated.

        In Section 2, there is a very large note that reads:

        NOTE ON PARAMETER VALUE LENGHTS: A short (length <= 78 characters)
        parameter value containing only non-`tspecials' characters SHOULD be
        represented as a single `token'. A short parameter value containing
        only ASCII characters, but including `tspecials' characters, SHOULD
        be represented as `quoted-string'. Parameter values longer than 78
        characters, or which contain non-ASCII characters, MUST be encoded as
        specified in [RFC 2184].

        Looking at the types of parameters, 4 of them are dates and one is an integer. The only one that's a string is the filename, so the note above must refer to it. RFC 2184 describes how to encode the non-ASCII case. Interestingly, it looks IE does not follow RFC 2184.

        Section 2.3 refers to RFC 2045, which is older than RFC 2184.

        Overall, I'd say the RFC is unclear on this issue.

        Thanks
        Moh

        Show
        Mohammad Rezaei added a comment - Ortwin, I think the RFC is worded strangely. It is certainly true that Section 2.3 says US-ASCII only, but it seems like that section is outdated. In Section 2, there is a very large note that reads: NOTE ON PARAMETER VALUE LENGHTS: A short (length <= 78 characters) parameter value containing only non-`tspecials' characters SHOULD be represented as a single `token'. A short parameter value containing only ASCII characters, but including `tspecials' characters, SHOULD be represented as `quoted-string'. Parameter values longer than 78 characters, or which contain non-ASCII characters, MUST be encoded as specified in [RFC 2184] . Looking at the types of parameters, 4 of them are dates and one is an integer. The only one that's a string is the filename, so the note above must refer to it. RFC 2184 describes how to encode the non-ASCII case. Interestingly, it looks IE does not follow RFC 2184. Section 2.3 refers to RFC 2045, which is older than RFC 2184. Overall, I'd say the RFC is unclear on this issue. Thanks Moh
        Hide
        Ortwin Glück added a comment -

        We should be spec compliant and not "compatible with most" implementations - I don't care how wrong IE7 implements this. RFC 2183, Section 2.3 clearly states the limitation to ASCII. People should just accept this limitation instead of trying to bend the standard to their needs. Standards are made to ensure interoperability, for $DIETY's sake. If you need to pass a non-ASCII filename, this is simply not the place for it. You could add another text/plain MIME part with a well-defined charset and pass the file name there for instance.

        Show
        Ortwin Glück added a comment - We should be spec compliant and not "compatible with most" implementations - I don't care how wrong IE7 implements this. RFC 2183, Section 2.3 clearly states the limitation to ASCII. People should just accept this limitation instead of trying to bend the standard to their needs. Standards are made to ensure interoperability, for $DIETY's sake. If you need to pass a non-ASCII filename, this is simply not the place for it. You could add another text/plain MIME part with a well-defined charset and pass the file name there for instance.
        Hide
        Ahmed Ashour added a comment -

        One of HtmlUnit users came across this bug while trying to upload a file with non-ASCII name.

        By sniffing the traffic generated by IE7, "filename" is encoded with page charset as Dolf has kindly suggested.

        However, IE7 does not send the charset after the 'Content-Type':

        ---------------------------
        Content-Disposition: form-data; name="field_name"; filename="C:\non_ascii.txt"
        Content-Type: text/plain
        ---------------------------

        So, to exactly mimic this behaviour, appreciate if part charset is separated from the "Content-Disposition" charset.

        Many thanks.

        Show
        Ahmed Ashour added a comment - One of HtmlUnit users came across this bug while trying to upload a file with non-ASCII name. By sniffing the traffic generated by IE7, "filename" is encoded with page charset as Dolf has kindly suggested. However, IE7 does not send the charset after the 'Content-Type': --------------------------- Content-Disposition: form-data; name="field_name"; filename="C:\non_ascii.txt" Content-Type: text/plain --------------------------- So, to exactly mimic this behaviour, appreciate if part charset is separated from the "Content-Disposition" charset. Many thanks.
        Hide
        Oleg Kalnichevski added a comment -

        Dolf,
        My bad. I overlooked that fact that the reference to MimeUtility was inside a comment block.

        Sebastian,
        I believe we can depend on JavaMail, as long as we do not have it in the repository and do not ship it with the release packages. Since we do not bundle dependencies with HttpClient anyways, this should not be a problem for us. Having said all that, I think Commons Codec HttpClient is already dependent upon provides all the necessary codecs (BASE64 and quote-printable). It is just a matter of someone taking up this job.

        Folks,
        Any objections to relaxing the compliance with the spec and applying the patch submitted by Dolf?

        Oleg

        Show
        Oleg Kalnichevski added a comment - Dolf, My bad. I overlooked that fact that the reference to MimeUtility was inside a comment block. Sebastian, I believe we can depend on JavaMail, as long as we do not have it in the repository and do not ship it with the release packages. Since we do not bundle dependencies with HttpClient anyways, this should not be a problem for us. Having said all that, I think Commons Codec HttpClient is already dependent upon provides all the necessary codecs (BASE64 and quote-printable). It is just a matter of someone taking up this job. Folks, Any objections to relaxing the compliance with the spec and applying the patch submitted by Dolf? Oleg
        Hide
        Sebb added a comment -

        Might even be a problem for 4.0 - the license for the JavaMail jar is such that it cannot be distributed by the ASF, as far as I am aware.

        Might be worth checking if Commons-Lang has anything suitable, e.g. in StringEscapeUtils.

        Show
        Sebb added a comment - Might even be a problem for 4.0 - the license for the JavaMail jar is such that it cannot be distributed by the ASF, as far as I am aware. Might be worth checking if Commons-Lang has anything suitable, e.g. in StringEscapeUtils.
        Hide
        Dolf Dijkstra added a comment -

        Hi Oleg,

        Maybe the report is not clear.

        According to the mult-part mime spec the correct behaviour would be to use a construct is via the MimeUtility. The problem with that is that the mime-parsers that I have tested with don't handle this correctly.
        When just encoding the filename with the charset of the request, it works but it is not according to the spec.

        The patch I handed in, works on most mime parsers (as IE is doing this too) but is not according to the spec.

        I understand that you don't want to introduce a new dependancy, but maybe you don't need to as the patch works without the MimeUtility. The line containing the MimeUtility is commented.

        Dolf

        Show
        Dolf Dijkstra added a comment - Hi Oleg, Maybe the report is not clear. According to the mult-part mime spec the correct behaviour would be to use a construct is via the MimeUtility. The problem with that is that the mime-parsers that I have tested with don't handle this correctly. When just encoding the filename with the charset of the request, it works but it is not according to the spec. The patch I handed in, works on most mime parsers (as IE is doing this too) but is not according to the spec. I understand that you don't want to introduce a new dependancy, but maybe you don't need to as the patch works without the MimeUtility. The line containing the MimeUtility is commented. Dolf
        Hide
        Oleg Kalnichevski added a comment -

        Dolf,

        We simply cannot not introduce a new dependency for HttpClient 3.x code line. This will have to wait until 4.0

        Oleg

        Show
        Oleg Kalnichevski added a comment - Dolf, We simply cannot not introduce a new dependency for HttpClient 3.x code line. This will have to wait until 4.0 Oleg
        Hide
        Dolf Dijkstra added a comment -

        Hi Oleg,

        Thanks for looking into this and sorry for not making clear where MimeUtility originates from.

        MimeUtility is from javax.mail (for instance http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/mail/internet/MimeUtility.html).

        Dolf

        Show
        Dolf Dijkstra added a comment - Hi Oleg, Thanks for looking into this and sorry for not making clear where MimeUtility originates from. MimeUtility is from javax.mail (for instance http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/mail/internet/MimeUtility.html ). Dolf
        Hide
        Oleg Kalnichevski added a comment -

        Dolf,
        What is MimeUtility and what package does it come from?

        Oleg

        Show
        Oleg Kalnichevski added a comment - Dolf, What is MimeUtility and what package does it come from? Oleg
        Hide
        Dolf Dijkstra added a comment -

        I have created a patch against revision 532277 for this problem. Although it is not according to the RFC it does do the job. For instance IE is doing the same for multi-part mime upload. Not that I am suggesting that IE is doing the right thing, but it does mean that probably many servers can deal with post.

        Index: src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java
        ===================================================================
        — src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java (revision 532277)
        +++ src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java (working copy)
        @@ -193,7 +193,11 @@
        if (filename != null)

        { out.write(FILE_NAME_BYTES); out.write(QUOTE_BYTES); - out.write(EncodingUtil.getAsciiBytes(filename)); + //still not the rigth thing according to RFC1522 + out.write( EncodingUtil.getBytes( filename, this.getCharSet() ) ); + /*TODO: the right thing would be to do this, but some MIMEDecoders can't handle it. + String s = MimeUtility.encodeText(filename); + */ out.write(QUOTE_BYTES); }

        }

        Show
        Dolf Dijkstra added a comment - I have created a patch against revision 532277 for this problem. Although it is not according to the RFC it does do the job. For instance IE is doing the same for multi-part mime upload. Not that I am suggesting that IE is doing the right thing, but it does mean that probably many servers can deal with post. Index: src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java =================================================================== — src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java (revision 532277) +++ src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java (working copy) @@ -193,7 +193,11 @@ if (filename != null) { out.write(FILE_NAME_BYTES); out.write(QUOTE_BYTES); - out.write(EncodingUtil.getAsciiBytes(filename)); + //still not the rigth thing according to RFC1522 + out.write( EncodingUtil.getBytes( filename, this.getCharSet() ) ); + /*TODO: the right thing would be to do this, but some MIMEDecoders can't handle it. + String s = MimeUtility.encodeText(filename); + */ out.write(QUOTE_BYTES); } }
        Oleg Kalnichevski made changes -
        Assignee HttpComponents Dev [ httpclient-dev@jakarta.apache.org ]
        Hide
        Francis Labaere added a comment -

        I just wanted to add some interesting RFC for this feature request:

        RFC 2231
        RFC 2047
        RFC 2184

        Show
        Francis Labaere added a comment - I just wanted to add some interesting RFC for this feature request: RFC 2231 RFC 2047 RFC 2184
        Henri Yandell made changes -
        Field Original Value New Value
        issue.field.bugzillaimportkey 24504 12333852
        Hide
        Oleg Kalnichevski added a comment -
        Show
        Oleg Kalnichevski added a comment - HTTPCLIENT-368 has been marked as a duplicate of this bug. ***
        Hide
        Oleg Kalnichevski added a comment -

        Re-opened as a feature request

        Show
        Oleg Kalnichevski added a comment - Re-opened as a feature request
        Hide
        Oleg Kalnichevski added a comment -

        Form-based File Upload in HTML specification (RFC 1867)
        <http://www.ietf.org/rfc/rfc1867.txt> that HttpClient implements follows the
        rules of all multipart MIME data streams as outlined in RFC 1521 and RFC 1522.
        MIME specification requires all non-ASCII content to be represented using ASCII
        charset only. Currently HttpClient does not perform such translation
        automatically. You will have to take care of filename encoding prior to passing
        it to the FilePart as a parameter.

        I was going to contribute quote-printable encoder/decoder to the Commons Codec
        library but never got a chance.

        To sum things up: if the relevant RFCs are to be strictly adhered to, the
        behaviour on the part of HttpClient is correct. However, I do agree that it
        would be nice if HttpClient took care of non-ASCII charset translation
        automatically. So, feel free to reopen this bug as a feature request.

        Oleg

        Show
        Oleg Kalnichevski added a comment - Form-based File Upload in HTML specification (RFC 1867) < http://www.ietf.org/rfc/rfc1867.txt > that HttpClient implements follows the rules of all multipart MIME data streams as outlined in RFC 1521 and RFC 1522. MIME specification requires all non-ASCII content to be represented using ASCII charset only. Currently HttpClient does not perform such translation automatically. You will have to take care of filename encoding prior to passing it to the FilePart as a parameter. I was going to contribute quote-printable encoder/decoder to the Commons Codec library but never got a chance. To sum things up: if the relevant RFCs are to be strictly adhered to, the behaviour on the part of HttpClient is correct. However, I do agree that it would be nice if HttpClient took care of non-ASCII charset translation automatically. So, feel free to reopen this bug as a feature request. Oleg
        Hide
        Eric Dofonsou added a comment -

        Hi Oleg.

        Yes, I'am using a multipart post.

        In our application we want to upload files to a file server from a java
        application via HTTP. We use multipart because we have to include extra
        information for the server application to be able to handle the data (ie : link
        the file to a database object etc ...). We also want to be able to upload
        multiple files (wichi works well as long as we have no accent in the filenames)


        Here is the code that buids the file parts

        HttpClient client = new HttpClient();
        MultipartPostMethod httpsPost = new MultipartPostMethod ( m_docServer );

        //Set header information
        httpsPost.setRequestHeader("Content-Type", "multipart/form-data;
        boundary="+BOUNDS);

        //Adding the main parts.
        StringPart partToAdd = new StringPart("ClassUID", classUID);
        partToAdd.setTransferEncoding(null);
        partToAdd.setContentType(null);
        httpsPost.addPart( partToAdd );

        partToAdd = new StringPart("MethodName", methodName);
        partToAdd.setTransferEncoding(null);
        partToAdd.setContentType(null);
        httpsPost.addPart( partToAdd );

        partToAdd = new StringPart("Params", params);
        partToAdd.setTransferEncoding(null);
        partToAdd.setContentType(null);
        httpsPost.addPart( partToAdd );

        //Adding teh files parts.
        int i=0;
        Iterator iterator = parts.keySet().iterator();
        AI_DOCPART part;
        String partID;
        String partFile;
        FilePart fPart;

        //loop until we have created all file parts.
        while(iterator.hasNext()){
        part = (AI_DOCPART)(iterator.next());
        partID = part.getIDAsString();
        partFile = (String) parts.get(part);
        try

        { fPart = new FilePart("FILE"+(i+1), new File(partFile)); //partToAdd.setContentType(null); //partToAdd.setTransferEncoding( null ); httpsPost.addPart(fPart); }

        catch (FileNotFoundException e)

        { throw new AIException("ERR_INVALIDE_FILENAME","",GUIMediator.getStringResource ("Corporate","ERR_INVALIDE_FILENAME"),""); }

        partToAdd = new StringPart("PARTNUMBER"+(i+1) , partID);
        partToAdd.setContentType(null);
        partToAdd.setTransferEncoding( null );
        httpsPost.addPart( partToAdd );
        i++;
        }

        //Set timeout in Milliseconds -> 30 secondes
        client.setConnectionTimeout( 30000 );

        //Send the data
        int status=0;
        try

        { status = client.executeMethod(httpsPost); }

        ...

        Show
        Eric Dofonsou added a comment - Hi Oleg. Yes, I'am using a multipart post. In our application we want to upload files to a file server from a java application via HTTP. We use multipart because we have to include extra information for the server application to be able to handle the data (ie : link the file to a database object etc ...). We also want to be able to upload multiple files (wichi works well as long as we have no accent in the filenames) – Here is the code that buids the file parts HttpClient client = new HttpClient(); MultipartPostMethod httpsPost = new MultipartPostMethod ( m_docServer ); //Set header information httpsPost.setRequestHeader("Content-Type", "multipart/form-data; boundary="+BOUNDS); //Adding the main parts. StringPart partToAdd = new StringPart("ClassUID", classUID); partToAdd.setTransferEncoding(null); partToAdd.setContentType(null); httpsPost.addPart( partToAdd ); partToAdd = new StringPart("MethodName", methodName); partToAdd.setTransferEncoding(null); partToAdd.setContentType(null); httpsPost.addPart( partToAdd ); partToAdd = new StringPart("Params", params); partToAdd.setTransferEncoding(null); partToAdd.setContentType(null); httpsPost.addPart( partToAdd ); //Adding teh files parts. int i=0; Iterator iterator = parts.keySet().iterator(); AI_DOCPART part; String partID; String partFile; FilePart fPart; //loop until we have created all file parts. while(iterator.hasNext()){ part = (AI_DOCPART)(iterator.next()); partID = part.getIDAsString(); partFile = (String) parts.get(part); try { fPart = new FilePart("FILE"+(i+1), new File(partFile)); //partToAdd.setContentType(null); //partToAdd.setTransferEncoding( null ); httpsPost.addPart(fPart); } catch (FileNotFoundException e) { throw new AIException("ERR_INVALIDE_FILENAME","",GUIMediator.getStringResource ("Corporate","ERR_INVALIDE_FILENAME"),""); } partToAdd = new StringPart("PARTNUMBER"+(i+1) , partID); partToAdd.setContentType(null); partToAdd.setTransferEncoding( null ); httpsPost.addPart( partToAdd ); i++; } //Set timeout in Milliseconds -> 30 secondes client.setConnectionTimeout( 30000 ); //Send the data int status=0; try { status = client.executeMethod(httpsPost); } ...
        Hide
        Oleg Kalnichevski added a comment -

        Eric,
        Are you using MultipartPostMethod by any chance? Please give me a bit more
        details about what your application is supposed to do and what you are trying to
        accomplish, so I would not have to play a private detective.

        Oleg

        Show
        Oleg Kalnichevski added a comment - Eric, Are you using MultipartPostMethod by any chance? Please give me a bit more details about what your application is supposed to do and what you are trying to accomplish, so I would not have to play a private detective. Oleg
        Hide
        Eric Dofonsou added a comment -

        My fault, by document I was refering to file (physical file onthe hard drive)
        ie : c:\work\DocumentDeTèst.txt <-- This filename has an accent.

        I am using the latest version : 2.0 Rc2

        As to getAsciiBytes method, as its name implies it is supposed to return ASCII
        characters only. So, the behaviour of the method is correct.

        Precisly, but because of that the accent based charaters are converted to ?
        ie : c:\work\DocumentDeTèst.txt --> c:\work\DocumentDeT?st.txt

        Show
        Eric Dofonsou added a comment - My fault, by document I was refering to file (physical file onthe hard drive) ie : c:\work\DocumentDeTèst.txt <-- This filename has an accent. I am using the latest version : 2.0 Rc2 As to getAsciiBytes method, as its name implies it is supposed to return ASCII characters only. So, the behaviour of the method is correct. Precisly, but because of that the accent based charaters are converted to ? ie : c:\work\DocumentDeTèst.txt --> c:\work\DocumentDeT?st.txt
        Hide
        Oleg Kalnichevski added a comment -

        Eric,
        My apologies, but I do not quite understand the nature of the problem. What do
        you mean by 'cannot create a document'? What do you mean by a document in the
        first place? Request content body? Response content body?

        what version of HttpClient are you using and what is it you are trying to get done?

        As to getAsciiBytes method, as its name implies it is supposed to return ASCII
        characters only. So, the behaviour of the method is correct.

        You might want to have a look at the HttpClient character encoding guide for
        more details:

        http://jakarta.apache.org/commons/httpclient/charencodings.html

        I'll have no choice but to mark the report as invalid unless more information is
        given

        Oleg

        Show
        Oleg Kalnichevski added a comment - Eric, My apologies, but I do not quite understand the nature of the problem. What do you mean by 'cannot create a document'? What do you mean by a document in the first place? Request content body? Response content body? what version of HttpClient are you using and what is it you are trying to get done? As to getAsciiBytes method, as its name implies it is supposed to return ASCII characters only. So, the behaviour of the method is correct. You might want to have a look at the HttpClient character encoding guide for more details: http://jakarta.apache.org/commons/httpclient/charencodings.html I'll have no choice but to mark the report as invalid unless more information is given Oleg
        Eric Dofonsou created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Eric Dofonsou
          • Votes:
            3 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development