Uploaded image for project: 'HttpComponents HttpClient'
  1. HttpComponents HttpClient
  2. HTTPCLIENT-293

Provide support for non-ASCII charsets in the multipart disposition-content header

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0 Alpha
    • Fix Version/s: 4.0 Beta 2
    • Component/s: HttpClient (classic)
    • Labels:
      None
    • Environment:
      Operating System: All
      Platform: All
    • Bugzilla Id:
      24504

      Description

      Because of the the following line in getAsciiBytes
      data.getBytes("US-ASCII");

      The returned string is modified if has Latin Characters.

      Ex : Document non-controlé -> Document non-control?

        Issue Links

          Activity

          Hide
          olegk Oleg Kalnichevski added a comment -

          Eric,
          My apologies, but I do not quite understand the nature of the problem. What do
          you mean by 'cannot create a document'? What do you mean by a document in the
          first place? Request content body? Response content body?

          what version of HttpClient are you using and what is it you are trying to get done?

          As to getAsciiBytes method, as its name implies it is supposed to return ASCII
          characters only. So, the behaviour of the method is correct.

          You might want to have a look at the HttpClient character encoding guide for
          more details:

          http://jakarta.apache.org/commons/httpclient/charencodings.html

          I'll have no choice but to mark the report as invalid unless more information is
          given

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - Eric, My apologies, but I do not quite understand the nature of the problem. What do you mean by 'cannot create a document'? What do you mean by a document in the first place? Request content body? Response content body? what version of HttpClient are you using and what is it you are trying to get done? As to getAsciiBytes method, as its name implies it is supposed to return ASCII characters only. So, the behaviour of the method is correct. You might want to have a look at the HttpClient character encoding guide for more details: http://jakarta.apache.org/commons/httpclient/charencodings.html I'll have no choice but to mark the report as invalid unless more information is given Oleg
          Hide
          ewrickspm@yahoo.com Eric Dofonsou added a comment -

          My fault, by document I was refering to file (physical file onthe hard drive)
          ie : c:\work\DocumentDeTèst.txt <-- This filename has an accent.

          I am using the latest version : 2.0 Rc2

          As to getAsciiBytes method, as its name implies it is supposed to return ASCII
          characters only. So, the behaviour of the method is correct.

          Precisly, but because of that the accent based charaters are converted to ?
          ie : c:\work\DocumentDeTèst.txt --> c:\work\DocumentDeT?st.txt

          Show
          ewrickspm@yahoo.com Eric Dofonsou added a comment - My fault, by document I was refering to file (physical file onthe hard drive) ie : c:\work\DocumentDeTèst.txt <-- This filename has an accent. I am using the latest version : 2.0 Rc2 As to getAsciiBytes method, as its name implies it is supposed to return ASCII characters only. So, the behaviour of the method is correct. Precisly, but because of that the accent based charaters are converted to ? ie : c:\work\DocumentDeTèst.txt --> c:\work\DocumentDeT?st.txt
          Hide
          olegk Oleg Kalnichevski added a comment -

          Eric,
          Are you using MultipartPostMethod by any chance? Please give me a bit more
          details about what your application is supposed to do and what you are trying to
          accomplish, so I would not have to play a private detective.

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - Eric, Are you using MultipartPostMethod by any chance? Please give me a bit more details about what your application is supposed to do and what you are trying to accomplish, so I would not have to play a private detective. Oleg
          Hide
          ewrickspm@yahoo.com Eric Dofonsou added a comment -

          Hi Oleg.

          Yes, I'am using a multipart post.

          In our application we want to upload files to a file server from a java
          application via HTTP. We use multipart because we have to include extra
          information for the server application to be able to handle the data (ie : link
          the file to a database object etc ...). We also want to be able to upload
          multiple files (wichi works well as long as we have no accent in the filenames)


          Here is the code that buids the file parts

          HttpClient client = new HttpClient();
          MultipartPostMethod httpsPost = new MultipartPostMethod ( m_docServer );

          //Set header information
          httpsPost.setRequestHeader("Content-Type", "multipart/form-data;
          boundary="+BOUNDS);

          //Adding the main parts.
          StringPart partToAdd = new StringPart("ClassUID", classUID);
          partToAdd.setTransferEncoding(null);
          partToAdd.setContentType(null);
          httpsPost.addPart( partToAdd );

          partToAdd = new StringPart("MethodName", methodName);
          partToAdd.setTransferEncoding(null);
          partToAdd.setContentType(null);
          httpsPost.addPart( partToAdd );

          partToAdd = new StringPart("Params", params);
          partToAdd.setTransferEncoding(null);
          partToAdd.setContentType(null);
          httpsPost.addPart( partToAdd );

          //Adding teh files parts.
          int i=0;
          Iterator iterator = parts.keySet().iterator();
          AI_DOCPART part;
          String partID;
          String partFile;
          FilePart fPart;

          //loop until we have created all file parts.
          while(iterator.hasNext()){
          part = (AI_DOCPART)(iterator.next());
          partID = part.getIDAsString();
          partFile = (String) parts.get(part);
          try

          { fPart = new FilePart("FILE"+(i+1), new File(partFile)); //partToAdd.setContentType(null); //partToAdd.setTransferEncoding( null ); httpsPost.addPart(fPart); }

          catch (FileNotFoundException e)

          { throw new AIException("ERR_INVALIDE_FILENAME","",GUIMediator.getStringResource ("Corporate","ERR_INVALIDE_FILENAME"),""); }

          partToAdd = new StringPart("PARTNUMBER"+(i+1) , partID);
          partToAdd.setContentType(null);
          partToAdd.setTransferEncoding( null );
          httpsPost.addPart( partToAdd );
          i++;
          }

          //Set timeout in Milliseconds -> 30 secondes
          client.setConnectionTimeout( 30000 );

          //Send the data
          int status=0;
          try {
          status = client.executeMethod(httpsPost);
          }
          ...

          Show
          ewrickspm@yahoo.com Eric Dofonsou added a comment - Hi Oleg. Yes, I'am using a multipart post. In our application we want to upload files to a file server from a java application via HTTP. We use multipart because we have to include extra information for the server application to be able to handle the data (ie : link the file to a database object etc ...). We also want to be able to upload multiple files (wichi works well as long as we have no accent in the filenames) – Here is the code that buids the file parts HttpClient client = new HttpClient(); MultipartPostMethod httpsPost = new MultipartPostMethod ( m_docServer ); //Set header information httpsPost.setRequestHeader("Content-Type", "multipart/form-data; boundary="+BOUNDS); //Adding the main parts. StringPart partToAdd = new StringPart("ClassUID", classUID); partToAdd.setTransferEncoding(null); partToAdd.setContentType(null); httpsPost.addPart( partToAdd ); partToAdd = new StringPart("MethodName", methodName); partToAdd.setTransferEncoding(null); partToAdd.setContentType(null); httpsPost.addPart( partToAdd ); partToAdd = new StringPart("Params", params); partToAdd.setTransferEncoding(null); partToAdd.setContentType(null); httpsPost.addPart( partToAdd ); //Adding teh files parts. int i=0; Iterator iterator = parts.keySet().iterator(); AI_DOCPART part; String partID; String partFile; FilePart fPart; //loop until we have created all file parts. while(iterator.hasNext()){ part = (AI_DOCPART)(iterator.next()); partID = part.getIDAsString(); partFile = (String) parts.get(part); try { fPart = new FilePart("FILE"+(i+1), new File(partFile)); //partToAdd.setContentType(null); //partToAdd.setTransferEncoding( null ); httpsPost.addPart(fPart); } catch (FileNotFoundException e) { throw new AIException("ERR_INVALIDE_FILENAME","",GUIMediator.getStringResource ("Corporate","ERR_INVALIDE_FILENAME"),""); } partToAdd = new StringPart("PARTNUMBER"+(i+1) , partID); partToAdd.setContentType(null); partToAdd.setTransferEncoding( null ); httpsPost.addPart( partToAdd ); i++; } //Set timeout in Milliseconds -> 30 secondes client.setConnectionTimeout( 30000 ); //Send the data int status=0; try { status = client.executeMethod(httpsPost); } ...
          Hide
          olegk Oleg Kalnichevski added a comment -

          Form-based File Upload in HTML specification (RFC 1867)
          <http://www.ietf.org/rfc/rfc1867.txt> that HttpClient implements follows the
          rules of all multipart MIME data streams as outlined in RFC 1521 and RFC 1522.
          MIME specification requires all non-ASCII content to be represented using ASCII
          charset only. Currently HttpClient does not perform such translation
          automatically. You will have to take care of filename encoding prior to passing
          it to the FilePart as a parameter.

          I was going to contribute quote-printable encoder/decoder to the Commons Codec
          library but never got a chance.

          To sum things up: if the relevant RFCs are to be strictly adhered to, the
          behaviour on the part of HttpClient is correct. However, I do agree that it
          would be nice if HttpClient took care of non-ASCII charset translation
          automatically. So, feel free to reopen this bug as a feature request.

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - Form-based File Upload in HTML specification (RFC 1867) < http://www.ietf.org/rfc/rfc1867.txt > that HttpClient implements follows the rules of all multipart MIME data streams as outlined in RFC 1521 and RFC 1522. MIME specification requires all non-ASCII content to be represented using ASCII charset only. Currently HttpClient does not perform such translation automatically. You will have to take care of filename encoding prior to passing it to the FilePart as a parameter. I was going to contribute quote-printable encoder/decoder to the Commons Codec library but never got a chance. To sum things up: if the relevant RFCs are to be strictly adhered to, the behaviour on the part of HttpClient is correct. However, I do agree that it would be nice if HttpClient took care of non-ASCII charset translation automatically. So, feel free to reopen this bug as a feature request. Oleg
          Hide
          olegk Oleg Kalnichevski added a comment -

          Re-opened as a feature request

          Show
          olegk Oleg Kalnichevski added a comment - Re-opened as a feature request
          Hide
          olegk Oleg Kalnichevski added a comment -
          Show
          olegk Oleg Kalnichevski added a comment - HTTPCLIENT-368 has been marked as a duplicate of this bug. ***
          Hide
          labaere Francis Labaere added a comment -

          I just wanted to add some interesting RFC for this feature request:

          RFC 2231
          RFC 2047
          RFC 2184

          Show
          labaere Francis Labaere added a comment - I just wanted to add some interesting RFC for this feature request: RFC 2231 RFC 2047 RFC 2184
          Hide
          ddijkstra Dolf Dijkstra added a comment -

          I have created a patch against revision 532277 for this problem. Although it is not according to the RFC it does do the job. For instance IE is doing the same for multi-part mime upload. Not that I am suggesting that IE is doing the right thing, but it does mean that probably many servers can deal with post.

          Index: src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java
          ===================================================================
          — src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java (revision 532277)
          +++ src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java (working copy)
          @@ -193,7 +193,11 @@
          if (filename != null)

          { out.write(FILE_NAME_BYTES); out.write(QUOTE_BYTES); - out.write(EncodingUtil.getAsciiBytes(filename)); + //still not the rigth thing according to RFC1522 + out.write( EncodingUtil.getBytes( filename, this.getCharSet() ) ); + /*TODO: the right thing would be to do this, but some MIMEDecoders can't handle it. + String s = MimeUtility.encodeText(filename); + */ out.write(QUOTE_BYTES); }

          }

          Show
          ddijkstra Dolf Dijkstra added a comment - I have created a patch against revision 532277 for this problem. Although it is not according to the RFC it does do the job. For instance IE is doing the same for multi-part mime upload. Not that I am suggesting that IE is doing the right thing, but it does mean that probably many servers can deal with post. Index: src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java =================================================================== — src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java (revision 532277) +++ src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java (working copy) @@ -193,7 +193,11 @@ if (filename != null) { out.write(FILE_NAME_BYTES); out.write(QUOTE_BYTES); - out.write(EncodingUtil.getAsciiBytes(filename)); + //still not the rigth thing according to RFC1522 + out.write( EncodingUtil.getBytes( filename, this.getCharSet() ) ); + /*TODO: the right thing would be to do this, but some MIMEDecoders can't handle it. + String s = MimeUtility.encodeText(filename); + */ out.write(QUOTE_BYTES); } }
          Hide
          olegk Oleg Kalnichevski added a comment -

          Dolf,
          What is MimeUtility and what package does it come from?

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - Dolf, What is MimeUtility and what package does it come from? Oleg
          Hide
          ddijkstra Dolf Dijkstra added a comment -

          Hi Oleg,

          Thanks for looking into this and sorry for not making clear where MimeUtility originates from.

          MimeUtility is from javax.mail (for instance http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/mail/internet/MimeUtility.html).

          Dolf

          Show
          ddijkstra Dolf Dijkstra added a comment - Hi Oleg, Thanks for looking into this and sorry for not making clear where MimeUtility originates from. MimeUtility is from javax.mail (for instance http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/mail/internet/MimeUtility.html ). Dolf
          Hide
          olegk Oleg Kalnichevski added a comment -

          Dolf,

          We simply cannot not introduce a new dependency for HttpClient 3.x code line. This will have to wait until 4.0

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - Dolf, We simply cannot not introduce a new dependency for HttpClient 3.x code line. This will have to wait until 4.0 Oleg
          Hide
          ddijkstra Dolf Dijkstra added a comment -

          Hi Oleg,

          Maybe the report is not clear.

          According to the mult-part mime spec the correct behaviour would be to use a construct is via the MimeUtility. The problem with that is that the mime-parsers that I have tested with don't handle this correctly.
          When just encoding the filename with the charset of the request, it works but it is not according to the spec.

          The patch I handed in, works on most mime parsers (as IE is doing this too) but is not according to the spec.

          I understand that you don't want to introduce a new dependancy, but maybe you don't need to as the patch works without the MimeUtility. The line containing the MimeUtility is commented.

          Dolf

          Show
          ddijkstra Dolf Dijkstra added a comment - Hi Oleg, Maybe the report is not clear. According to the mult-part mime spec the correct behaviour would be to use a construct is via the MimeUtility. The problem with that is that the mime-parsers that I have tested with don't handle this correctly. When just encoding the filename with the charset of the request, it works but it is not according to the spec. The patch I handed in, works on most mime parsers (as IE is doing this too) but is not according to the spec. I understand that you don't want to introduce a new dependancy, but maybe you don't need to as the patch works without the MimeUtility. The line containing the MimeUtility is commented. Dolf
          Hide
          sebb@apache.org Sebb added a comment -

          Might even be a problem for 4.0 - the license for the JavaMail jar is such that it cannot be distributed by the ASF, as far as I am aware.

          Might be worth checking if Commons-Lang has anything suitable, e.g. in StringEscapeUtils.

          Show
          sebb@apache.org Sebb added a comment - Might even be a problem for 4.0 - the license for the JavaMail jar is such that it cannot be distributed by the ASF, as far as I am aware. Might be worth checking if Commons-Lang has anything suitable, e.g. in StringEscapeUtils.
          Hide
          olegk Oleg Kalnichevski added a comment -

          Dolf,
          My bad. I overlooked that fact that the reference to MimeUtility was inside a comment block.

          Sebastian,
          I believe we can depend on JavaMail, as long as we do not have it in the repository and do not ship it with the release packages. Since we do not bundle dependencies with HttpClient anyways, this should not be a problem for us. Having said all that, I think Commons Codec HttpClient is already dependent upon provides all the necessary codecs (BASE64 and quote-printable). It is just a matter of someone taking up this job.

          Folks,
          Any objections to relaxing the compliance with the spec and applying the patch submitted by Dolf?

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - Dolf, My bad. I overlooked that fact that the reference to MimeUtility was inside a comment block. Sebastian, I believe we can depend on JavaMail, as long as we do not have it in the repository and do not ship it with the release packages. Since we do not bundle dependencies with HttpClient anyways, this should not be a problem for us. Having said all that, I think Commons Codec HttpClient is already dependent upon provides all the necessary codecs (BASE64 and quote-printable). It is just a matter of someone taking up this job. Folks, Any objections to relaxing the compliance with the spec and applying the patch submitted by Dolf? Oleg
          Hide
          asashour Ahmed Ashour added a comment -

          One of HtmlUnit users came across this bug while trying to upload a file with non-ASCII name.

          By sniffing the traffic generated by IE7, "filename" is encoded with page charset as Dolf has kindly suggested.

          However, IE7 does not send the charset after the 'Content-Type':

          ---------------------------
          Content-Disposition: form-data; name="field_name"; filename="C:\non_ascii.txt"
          Content-Type: text/plain
          ---------------------------

          So, to exactly mimic this behaviour, appreciate if part charset is separated from the "Content-Disposition" charset.

          Many thanks.

          Show
          asashour Ahmed Ashour added a comment - One of HtmlUnit users came across this bug while trying to upload a file with non-ASCII name. By sniffing the traffic generated by IE7, "filename" is encoded with page charset as Dolf has kindly suggested. However, IE7 does not send the charset after the 'Content-Type': --------------------------- Content-Disposition: form-data; name="field_name"; filename="C:\non_ascii.txt" Content-Type: text/plain --------------------------- So, to exactly mimic this behaviour, appreciate if part charset is separated from the "Content-Disposition" charset. Many thanks.
          Hide
          oglueck Ortwin Glück added a comment -

          We should be spec compliant and not "compatible with most" implementations - I don't care how wrong IE7 implements this. RFC 2183, Section 2.3 clearly states the limitation to ASCII. People should just accept this limitation instead of trying to bend the standard to their needs. Standards are made to ensure interoperability, for $DIETY's sake. If you need to pass a non-ASCII filename, this is simply not the place for it. You could add another text/plain MIME part with a well-defined charset and pass the file name there for instance.

          Show
          oglueck Ortwin Glück added a comment - We should be spec compliant and not "compatible with most" implementations - I don't care how wrong IE7 implements this. RFC 2183, Section 2.3 clearly states the limitation to ASCII. People should just accept this limitation instead of trying to bend the standard to their needs. Standards are made to ensure interoperability, for $DIETY's sake. If you need to pass a non-ASCII filename, this is simply not the place for it. You could add another text/plain MIME part with a well-defined charset and pass the file name there for instance.
          Hide
          mrezaei Mohammad Rezaei added a comment -

          Ortwin, I think the RFC is worded strangely. It is certainly true that Section 2.3 says US-ASCII only, but it seems like that section is outdated.

          In Section 2, there is a very large note that reads:

          NOTE ON PARAMETER VALUE LENGHTS: A short (length <= 78 characters)
          parameter value containing only non-`tspecials' characters SHOULD be
          represented as a single `token'. A short parameter value containing
          only ASCII characters, but including `tspecials' characters, SHOULD
          be represented as `quoted-string'. Parameter values longer than 78
          characters, or which contain non-ASCII characters, MUST be encoded as
          specified in [RFC 2184].

          Looking at the types of parameters, 4 of them are dates and one is an integer. The only one that's a string is the filename, so the note above must refer to it. RFC 2184 describes how to encode the non-ASCII case. Interestingly, it looks IE does not follow RFC 2184.

          Section 2.3 refers to RFC 2045, which is older than RFC 2184.

          Overall, I'd say the RFC is unclear on this issue.

          Thanks
          Moh

          Show
          mrezaei Mohammad Rezaei added a comment - Ortwin, I think the RFC is worded strangely. It is certainly true that Section 2.3 says US-ASCII only, but it seems like that section is outdated. In Section 2, there is a very large note that reads: NOTE ON PARAMETER VALUE LENGHTS: A short (length <= 78 characters) parameter value containing only non-`tspecials' characters SHOULD be represented as a single `token'. A short parameter value containing only ASCII characters, but including `tspecials' characters, SHOULD be represented as `quoted-string'. Parameter values longer than 78 characters, or which contain non-ASCII characters, MUST be encoded as specified in [RFC 2184] . Looking at the types of parameters, 4 of them are dates and one is an integer. The only one that's a string is the filename, so the note above must refer to it. RFC 2184 describes how to encode the non-ASCII case. Interestingly, it looks IE does not follow RFC 2184. Section 2.3 refers to RFC 2045, which is older than RFC 2184. Overall, I'd say the RFC is unclear on this issue. Thanks Moh
          Hide
          oglueck Ortwin Glück added a comment -

          Interesting, although I have never seen it being used in the wild. By the way, RFC 2184 is obsoleted by RFC 2231.

          Show
          oglueck Ortwin Glück added a comment - Interesting, although I have never seen it being used in the wild. By the way, RFC 2184 is obsoleted by RFC 2231.
          Hide
          olegk Oleg Kalnichevski added a comment -

          MultipartEntity now encodes non-ASCII characters in the disposition-content header using content charset when used in the browser compatibility mode and replaces non-ASCII characters with ? when used in the strict mode. One always has an option to encode the file name using one of the standard encoding mechanisms as described in RFC2231 and RFC2047.

          Closing this issue as resolved.

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - MultipartEntity now encodes non-ASCII characters in the disposition-content header using content charset when used in the browser compatibility mode and replaces non-ASCII characters with ? when used in the strict mode. One always has an option to encode the file name using one of the standard encoding mechanisms as described in RFC2231 and RFC2047. Closing this issue as resolved. Oleg
          Hide
          sermojohn Ioannis Sermetziadis added a comment -

          I believe that the HttpClient should implement RFC2231 by using asterisks to support use of header parameter values in character sets other than US-ASCII, like in the Content-Disposition header.

          So, when a file is uploaded using MultipartEntity, the FormBodyPart should include a Content-Disposition header that follows the specification, in order to correctly encode the file name, in case it uses a character set other than US-ASCII.

          An example of such a header is:
          Content-Disposition=form-data; name=file; filename*=utf-8''test

          If you agree, I could submit a patch on this.

          Show
          sermojohn Ioannis Sermetziadis added a comment - I believe that the HttpClient should implement RFC2231 by using asterisks to support use of header parameter values in character sets other than US-ASCII, like in the Content-Disposition header. So, when a file is uploaded using MultipartEntity, the FormBodyPart should include a Content-Disposition header that follows the specification, in order to correctly encode the file name, in case it uses a character set other than US-ASCII. An example of such a header is: Content-Disposition=form-data; name=file; filename*=utf-8''test If you agree, I could submit a patch on this.
          Show
          reschke Julian Reschke added a comment - Do you have evidence of anybody using RFC 2231 here? See < https://www.greenbytes.de/tech/webdav/rfc7578.html#form-charset > and < https://www.greenbytes.de/tech/webdav/rfc7578.html#rfc.section.4.2.p.5 >.
          Hide
          olegk Oleg Kalnichevski added a comment -

          @Ioannis Sermetziadis Julian Reschke HttpClient presently supports three MIME multipart modes: strict (RFC 822, RFC 2045, RFC 2046), browser compatible, RFC 6532 compatible. Even if RFC 2231 is not known to be widely used if someone is willing to contribute support for it with proper test coverage I see no reason why we should not take it.

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - @ Ioannis Sermetziadis Julian Reschke HttpClient presently supports three MIME multipart modes: strict (RFC 822, RFC 2045, RFC 2046), browser compatible, RFC 6532 compatible. Even if RFC 2231 is not known to be widely used if someone is willing to contribute support for it with proper test coverage I see no reason why we should not take it. Oleg
          Hide
          reschke Julian Reschke added a comment - - edited

          What's the point in implementing something that the applicable spec says "MUST NOT"? As far as I can tell, that spec defines a different approach which is supposed to be what at least some user agents already do.

          (And yes, it's entirely possible that the spec is incorrect, in which case proper tests and reporting the problem to the IETF would be the right answer)

          (Also, RFC 6532 seems to be entirely irrelevant in this context)

          Show
          reschke Julian Reschke added a comment - - edited What's the point in implementing something that the applicable spec says "MUST NOT"? As far as I can tell, that spec defines a different approach which is supposed to be what at least some user agents already do. (And yes, it's entirely possible that the spec is incorrect, in which case proper tests and reporting the problem to the IETF would be the right answer) (Also, RFC 6532 seems to be entirely irrelevant in this context)
          Hide
          olegk Oleg Kalnichevski added a comment -

          @Julian Reschke Julian, from your previous statement I understood RFC 2231 had not been not in widespread use but not that it had been superseded by another spec. What applicable spec are you referring to? I am also fine with dropping RFC 6532 if there is a superseding spec.

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - @ Julian Reschke Julian, from your previous statement I understood RFC 2231 had not been not in widespread use but not that it had been superseded by another spec. What applicable spec are you referring to? I am also fine with dropping RFC 6532 if there is a superseding spec. Oleg
          Hide
          reschke Julian Reschke added a comment -

          Trying to clarify:

          a) AFAIU, RFC 2231 encoding is not used in multipart payloads.

          b) RFC 6532 is irrelevant (being about header fields in email).

          c) The current spec about multipart/form-data is RFC 7578 (<https://www.greenbytes.de/tech/webdav/rfc7578.html>) which I already quoted above (see <https://www.iana.org/assignments/media-types/media-types.xhtml#multipart>).

          Show
          reschke Julian Reschke added a comment - Trying to clarify: a) AFAIU, RFC 2231 encoding is not used in multipart payloads. b) RFC 6532 is irrelevant (being about header fields in email). c) The current spec about multipart/form-data is RFC 7578 (< https://www.greenbytes.de/tech/webdav/rfc7578.html >) which I already quoted above (see < https://www.iana.org/assignments/media-types/media-types.xhtml#multipart >).
          Hide
          olegk Oleg Kalnichevski added a comment - - edited

          @Julian Reschke

          b) RFC 6532 is irrelevant (being about header fields in email).

          I fail to see why this makes it irrelevant but I see no problem dropping RFC 6532 support in favor of RFC 7578 conformant implementation.

          @Ioannis Sermetziadis

          Would you be interested in working on RFC 7578 compliance instead?

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - - edited @ Julian Reschke b) RFC 6532 is irrelevant (being about header fields in email). I fail to see why this makes it irrelevant but I see no problem dropping RFC 6532 support in favor of RFC 7578 conformant implementation. @ Ioannis Sermetziadis Would you be interested in working on RFC 7578 compliance instead? Oleg
          Hide
          sermojohn Ioannis Sermetziadis added a comment -

          I see how RFC 7578 (in section 2 and 4.2) handles the non-ASCII character usage in the multipart part's content-disposition header.

          Both RFC 7578 and RFC 2231 seem to provide a solution to the problem but I do not know which approach is the currently dominant. Based on the big difference in their release date, I would assume that RFC 2231 was in wide use before RFC 7578 was released, so providing support for that would be a benefit. Additionally, supporting RFC 7578 would be valuable, as it seems simple and efficient.

          Sure, I would be interested to work on both or one of the options. Not sure, however, how the implementation might conflict with the existing HttpMultipartModes. For example, it is not clear to me which specifications the browser_compatible mode follows. Also, should the HttpClient be backwards compatible with the currently defined modes?

          Show
          sermojohn Ioannis Sermetziadis added a comment - I see how RFC 7578 (in section 2 and 4.2) handles the non-ASCII character usage in the multipart part's content-disposition header. Both RFC 7578 and RFC 2231 seem to provide a solution to the problem but I do not know which approach is the currently dominant. Based on the big difference in their release date, I would assume that RFC 2231 was in wide use before RFC 7578 was released, so providing support for that would be a benefit. Additionally, supporting RFC 7578 would be valuable, as it seems simple and efficient. Sure, I would be interested to work on both or one of the options. Not sure, however, how the implementation might conflict with the existing HttpMultipartModes. For example, it is not clear to me which specifications the browser_compatible mode follows. Also, should the HttpClient be backwards compatible with the currently defined modes?
          Hide
          olegk Oleg Kalnichevski added a comment -

          For example, it is not clear to me which specifications the browser_compatible mode follows.

          There is no specification to speak of. It just represents an attempt at simulating the behavior of commons browsers.

          Also, should the HttpClient be backwards compatible with the currently defined modes

          Depends on what branch you decide to contribute to. It is certainly the case for 4.5.x but we can be more flexible in 5.x.

          Oleg

          Show
          olegk Oleg Kalnichevski added a comment - For example, it is not clear to me which specifications the browser_compatible mode follows. There is no specification to speak of. It just represents an attempt at simulating the behavior of commons browsers. Also, should the HttpClient be backwards compatible with the currently defined modes Depends on what branch you decide to contribute to. It is certainly the case for 4.5.x but we can be more flexible in 5.x. Oleg
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user sermojohn opened a pull request:

          https://github.com/apache/httpcomponents-client/pull/85

          HTTPCLIENT-293 Fix proposal based on RFC 7578

          I implemented the fix in two commits, because I believe that some refactoring was required in order to handle the part header field parameters (name, filename) properly in the Content-Disposition part header field. Also a unit test was implemented that fails on the first commit but succeeds on the second that includes the actual patch.

          I did some research about the percent encoding, which as I can understand is quite loose concerning the characters that it should always encode depending on the context. The actual percent character encoding implementation was copied from the common-codecs' URLCodec but it could not be reused as it was implemented, because that class includes URL specific encoding (e.g. ' ' -> '+')

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/sermojohn/httpcomponents-client 4.6.x

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/httpcomponents-client/pull/85.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #85


          commit 4fbaa720ab2ee35f3281d193ce735d1f689dd175
          Author: Ioannis Sermetziadis <sermojohn@gmail.com>
          Date: 2017-09-28T20:56:49Z

          HTTPCLIENT-293 Refactored code in order to support multipart header field parameters in the data model and postpone the formatting and encoding of the parameters until the moment written into a stream, which is essential in order to avoid multiple encodings of the same value. Also provided a test case that fails due to incorrectly handling non US-ASCII characters in the filename field of the Content-Disposition header.

          commit 1882a011ea49ea9c824fbea22a2905eb09bbe9ef
          Author: Ioannis Sermetziadis <sermojohn@gmail.com>
          Date: 2017-09-28T21:12:13Z

          HTTPCLIENT-293 Implemented the percent encoding of the filename parameter of the Content-Disposition header based on RFC7578 sections 2 and 4.2. Unit test is updated to use the new HttpMultipartMode successfully. In the new MultipartForm implementation I included a PercentCodec that performs encoding/decoding to/from the percent encoding as described in RFC7578 and RFC3986. The PercentCodec class as well as some inner classes should be proposed to the commons-codec project, which apparently does not provide a generic (URLCodec is not).


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user sermojohn opened a pull request: https://github.com/apache/httpcomponents-client/pull/85 HTTPCLIENT-293 Fix proposal based on RFC 7578 I implemented the fix in two commits, because I believe that some refactoring was required in order to handle the part header field parameters (name, filename) properly in the Content-Disposition part header field. Also a unit test was implemented that fails on the first commit but succeeds on the second that includes the actual patch. I did some research about the percent encoding, which as I can understand is quite loose concerning the characters that it should always encode depending on the context. The actual percent character encoding implementation was copied from the common-codecs' URLCodec but it could not be reused as it was implemented, because that class includes URL specific encoding (e.g. ' ' -> '+') You can merge this pull request into a Git repository by running: $ git pull https://github.com/sermojohn/httpcomponents-client 4.6.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/httpcomponents-client/pull/85.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #85 commit 4fbaa720ab2ee35f3281d193ce735d1f689dd175 Author: Ioannis Sermetziadis <sermojohn@gmail.com> Date: 2017-09-28T20:56:49Z HTTPCLIENT-293 Refactored code in order to support multipart header field parameters in the data model and postpone the formatting and encoding of the parameters until the moment written into a stream, which is essential in order to avoid multiple encodings of the same value. Also provided a test case that fails due to incorrectly handling non US-ASCII characters in the filename field of the Content-Disposition header. commit 1882a011ea49ea9c824fbea22a2905eb09bbe9ef Author: Ioannis Sermetziadis <sermojohn@gmail.com> Date: 2017-09-28T21:12:13Z HTTPCLIENT-293 Implemented the percent encoding of the filename parameter of the Content-Disposition header based on RFC7578 sections 2 and 4.2. Unit test is updated to use the new HttpMultipartMode successfully. In the new MultipartForm implementation I included a PercentCodec that performs encoding/decoding to/from the percent encoding as described in RFC7578 and RFC3986. The PercentCodec class as well as some inner classes should be proposed to the commons-codec project, which apparently does not provide a generic (URLCodec is not).
          Hide
          sermojohn Ioannis Sermetziadis added a comment -

          I submitted a PR that includes an implementation of RFC 7578 as Julian Reschke suggested.
          https://github.com/apache/httpcomponents-client/pull/85

          I would appreciate your feedback!

          Ioannis

          Show
          sermojohn Ioannis Sermetziadis added a comment - I submitted a PR that includes an implementation of RFC 7578 as Julian Reschke suggested. https://github.com/apache/httpcomponents-client/pull/85 I would appreciate your feedback! Ioannis
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ok2c commented on the issue:

          https://github.com/apache/httpcomponents-client/pull/85

          @sermojohn Do you mind making a few relatively minor changes to your patch?
          1. Could you please keep `MinimalField` immutable and move all construction logic to `FormBodyPartBuilder`?
          2. `FormBodyPartBuildeencodeForHeader` appears to be no longer needed. It should probably be removed.
          3. Could you please avoid different formatting of imports? You might need to tweak your IDE settings for that.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ok2c commented on the issue: https://github.com/apache/httpcomponents-client/pull/85 @sermojohn Do you mind making a few relatively minor changes to your patch? 1. Could you please keep `MinimalField` immutable and move all construction logic to `FormBodyPartBuilder`? 2. `FormBodyPartBuildeencodeForHeader` appears to be no longer needed. It should probably be removed. 3. Could you please avoid different formatting of imports? You might need to tweak your IDE settings for that.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user sermojohn commented on the issue:

          https://github.com/apache/httpcomponents-client/pull/85

          @ok2c pushed the suggested changes. Thank you for the feedback.

          Removed both FormBodyPartBuilder.encodeForHeader and the corresponding unit test, as I did not find anywhere in the spec this encoding rule.

          Fyi, I could not build the 4.6.x branch with java 1.6 version, as it is reported by animal-sniffer-maven-plugin that undefined references are used (AuthSchemeRegistry.java:127, SchemeRegistry.java:151, CookieSpecRegistry.java:139). It worked when I switched to java 1.8 source compilation and target.

          Show
          githubbot ASF GitHub Bot added a comment - Github user sermojohn commented on the issue: https://github.com/apache/httpcomponents-client/pull/85 @ok2c pushed the suggested changes. Thank you for the feedback. Removed both FormBodyPartBuilder.encodeForHeader and the corresponding unit test, as I did not find anywhere in the spec this encoding rule. Fyi, I could not build the 4.6.x branch with java 1.6 version, as it is reported by animal-sniffer-maven-plugin that undefined references are used (AuthSchemeRegistry.java:127, SchemeRegistry.java:151, CookieSpecRegistry.java:139). It worked when I switched to java 1.8 source compilation and target.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ok2c commented on the issue:

          https://github.com/apache/httpcomponents-client/pull/85

          @sermojohn Is there any reason why you encode parameters of the CONTENT_DISPOSITION field only? Why not all fields?

          Show
          githubbot ASF GitHub Bot added a comment - Github user ok2c commented on the issue: https://github.com/apache/httpcomponents-client/pull/85 @sermojohn Is there any reason why you encode parameters of the CONTENT_DISPOSITION field only? Why not all fields?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user sermojohn commented on the issue:

          https://github.com/apache/httpcomponents-client/pull/85

          Only the filename parameter of the Content-Disposition field is currently encoded, because it is defined by the specification (RFC 7578) that should support non US-ASCII characters. The other parameters, "name" (in Content-Disposition) and "charset" (in Content-Type) should only contain US-ASCII characters.

          I could modify the code that creates the "charset" parameter of Content-Type field, but I noticed that this field always appears without quotes around the value, which makes it a bit incompatible with the handling of the Content-Disposition parameters that always appear quoted. So, I believe I should avoid changing that code, as it also seems irrelevant with the current fix. Hope you agree.

          Show
          githubbot ASF GitHub Bot added a comment - Github user sermojohn commented on the issue: https://github.com/apache/httpcomponents-client/pull/85 Only the filename parameter of the Content-Disposition field is currently encoded, because it is defined by the specification (RFC 7578) that should support non US-ASCII characters. The other parameters, "name" (in Content-Disposition) and "charset" (in Content-Type) should only contain US-ASCII characters. I could modify the code that creates the "charset" parameter of Content-Type field, but I noticed that this field always appears without quotes around the value, which makes it a bit incompatible with the handling of the Content-Disposition parameters that always appear quoted. So, I believe I should avoid changing that code, as it also seems irrelevant with the current fix. Hope you agree.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 582b28060335c443f971b7fe02bbfc9f3d44bf44 in httpcomponents-client's branch refs/heads/4.6.x from Ioannis Sermetziadis
          [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=582b280 ]

          HTTPCLIENT-293 Refactored code in order to support multipart header field parameters in the data model and postpone the formatting and encoding of the parameters until the moment written into a stream, which is essential in order to avoid multiple encodings of the same value.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 582b28060335c443f971b7fe02bbfc9f3d44bf44 in httpcomponents-client's branch refs/heads/4.6.x from Ioannis Sermetziadis [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=582b280 ] HTTPCLIENT-293 Refactored code in order to support multipart header field parameters in the data model and postpone the formatting and encoding of the parameters until the moment written into a stream, which is essential in order to avoid multiple encodings of the same value.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 278e47d277568045ab8ff3c42677c791f0227d03 in httpcomponents-client's branch refs/heads/4.6.x from Ioannis Sermetziadis
          [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=278e47d ]

          HTTPCLIENT-293 Implemented the percent encoding of the filename parameter of the Content-Disposition header based on RFC7578 sections 2 and 4.2. In the new MultipartForm implementation I included a PercentCodec that performs encoding/decoding to/from the percent encoding as described in RFC7578 and RFC3986.

          Closes #85

          Show
          jira-bot ASF subversion and git services added a comment - Commit 278e47d277568045ab8ff3c42677c791f0227d03 in httpcomponents-client's branch refs/heads/4.6.x from Ioannis Sermetziadis [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=278e47d ] HTTPCLIENT-293 Implemented the percent encoding of the filename parameter of the Content-Disposition header based on RFC7578 sections 2 and 4.2. In the new MultipartForm implementation I included a PercentCodec that performs encoding/decoding to/from the percent encoding as described in RFC7578 and RFC3986. Closes #85
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 7aef825326a9accca5b1fb8c0ee82597ac7105d6 in httpcomponents-client's branch refs/heads/master from Ioannis Sermetziadis
          [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=7aef825 ]

          HTTPCLIENT-293 Refactored code in order to support multipart header field parameters in the data model and postpone the formatting and encoding of the parameters until the moment written into a stream, which is essential in order to avoid multiple encodings of the same value.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 7aef825326a9accca5b1fb8c0ee82597ac7105d6 in httpcomponents-client's branch refs/heads/master from Ioannis Sermetziadis [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=7aef825 ] HTTPCLIENT-293 Refactored code in order to support multipart header field parameters in the data model and postpone the formatting and encoding of the parameters until the moment written into a stream, which is essential in order to avoid multiple encodings of the same value.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b017a1f2b9b8e12f913843f58137962502d343a8 in httpcomponents-client's branch refs/heads/master from Ioannis Sermetziadis
          [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=b017a1f ]

          HTTPCLIENT-293 Implemented the percent encoding of the filename parameter of the Content-Disposition header based on RFC7578 sections 2 and 4.2. In the new MultipartForm implementation I included a PercentCodec that performs encoding/decoding to/from the percent encoding as described in RFC7578 and RFC3986.

          Show
          jira-bot ASF subversion and git services added a comment - Commit b017a1f2b9b8e12f913843f58137962502d343a8 in httpcomponents-client's branch refs/heads/master from Ioannis Sermetziadis [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=b017a1f ] HTTPCLIENT-293 Implemented the percent encoding of the filename parameter of the Content-Disposition header based on RFC7578 sections 2 and 4.2. In the new MultipartForm implementation I included a PercentCodec that performs encoding/decoding to/from the percent encoding as described in RFC7578 and RFC3986.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ok2c commented on the issue:

          https://github.com/apache/httpcomponents-client/pull/85

          @sermojohn
          I committed your changes with some modifications. I fixed multiple style check violations, escaped non Latin chars in test cases, and redesigned your code to generate fewer intermediate objects / garbage on the heap.

          Please review and close this PR. Feel free to raise another PR with follow-up changes.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ok2c commented on the issue: https://github.com/apache/httpcomponents-client/pull/85 @sermojohn I committed your changes with some modifications. I fixed multiple style check violations, escaped non Latin chars in test cases, and redesigned your code to generate fewer intermediate objects / garbage on the heap. Please review and close this PR. Feel free to raise another PR with follow-up changes.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 9560aef4765ec7c50d29cd7ca7ee735bf3a6c3b6 in httpcomponents-client's branch refs/heads/master from Ioannis Sermetziadis
          [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=9560aef ]

          HTTPCLIENT-293 Refactored code in order to support multipart header field parameters in the data model and postpone the formatting and encoding of the parameters until the moment written into a stream, which is essential in order to avoid multiple encodings of the same value.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 9560aef4765ec7c50d29cd7ca7ee735bf3a6c3b6 in httpcomponents-client's branch refs/heads/master from Ioannis Sermetziadis [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=9560aef ] HTTPCLIENT-293 Refactored code in order to support multipart header field parameters in the data model and postpone the formatting and encoding of the parameters until the moment written into a stream, which is essential in order to avoid multiple encodings of the same value.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a424709d89504aadd7b3d59129902666d79d0c15 in httpcomponents-client's branch refs/heads/master from Ioannis Sermetziadis
          [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=a424709 ]

          HTTPCLIENT-293 Implemented the percent encoding of the filename parameter of the Content-Disposition header based on RFC7578 sections 2 and 4.2. In the new MultipartForm implementation I included a PercentCodec that performs encoding/decoding to/from the percent encoding as described in RFC7578 and RFC3986.

          Show
          jira-bot ASF subversion and git services added a comment - Commit a424709d89504aadd7b3d59129902666d79d0c15 in httpcomponents-client's branch refs/heads/master from Ioannis Sermetziadis [ https://git-wip-us.apache.org/repos/asf?p=httpcomponents-client.git;h=a424709 ] HTTPCLIENT-293 Implemented the percent encoding of the filename parameter of the Content-Disposition header based on RFC7578 sections 2 and 4.2. In the new MultipartForm implementation I included a PercentCodec that performs encoding/decoding to/from the percent encoding as described in RFC7578 and RFC3986.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ok2c commented on the issue:

          https://github.com/apache/httpcomponents-client/pull/85

          @sermojohn Please close this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ok2c commented on the issue: https://github.com/apache/httpcomponents-client/pull/85 @sermojohn Please close this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user sermojohn closed the pull request at:

          https://github.com/apache/httpcomponents-client/pull/85

          Show
          githubbot ASF GitHub Bot added a comment - Github user sermojohn closed the pull request at: https://github.com/apache/httpcomponents-client/pull/85

            People

            • Assignee:
              Unassigned
              Reporter:
              ewrickspm@yahoo.com Eric Dofonsou
            • Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development