Issue Details (XML | Word | Printable)

Key: HTTPCLIENT-293
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Minor Minor
Assignee: Unassigned
Reporter: Eric Dofonsou
Votes: 3
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
HttpComponents HttpClient

Provide support for non-ASCII charsets in the multipart disposition-content header

Created: 08/Nov/03 12:48 AM   Updated: 02/Dec/08 12:50 PM
Return to search
Component/s: HttpMime
Affects Version/s: 1.0 Alpha
Fix Version/s: 4.0 Beta 2

Time Tracking:
Not Specified

Environment:
Operating System: All
Platform: All

Bugzilla Id: 24504
Resolution Date: 02/Dec/08 12:50 PM


 Description  « Hide
Because of the the following line in getAsciiBytes
 data.getBytes("US-ASCII");

The returned string is modified if has Latin Characters.

Ex : Document non-controlé -> Document non-control?

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Oleg Kalnichevski added a comment - 08/Nov/03 07:29 PM
Eric,
My apologies, but I do not quite understand the nature of the problem. What do
you mean by 'cannot create a document'? What do you mean by a document in the
first place? Request content body? Response content body?

what version of HttpClient are you using and what is it you are trying to get done?

As to getAsciiBytes method, as its name implies it is supposed to return ASCII
characters only. So, the behaviour of the method is correct.

You might want to have a look at the HttpClient character encoding guide for
more details:

http://jakarta.apache.org/commons/httpclient/charencodings.html

I'll have no choice but to mark the report as invalid unless more information is
given

Oleg

Eric Dofonsou added a comment - 14/Nov/03 08:36 AM

My fault, by document I was refering to file (physical file onthe hard drive)
ie : c:\work\DocumentDeTèst.txt <-- This filename has an accent.


I am using the latest version : 2.0 Rc2

As to getAsciiBytes method, as its name implies it is supposed to return ASCII
characters only. So, the behaviour of the method is correct.

Precisly, but because of that the accent based charaters are converted to ?
ie : c:\work\DocumentDeTèst.txt --> c:\work\DocumentDeT?st.txt

Oleg Kalnichevski added a comment - 14/Nov/03 04:58 PM
Eric,
Are you using MultipartPostMethod by any chance? Please give me a bit more
details about what your application is supposed to do and what you are trying to
accomplish, so I would not have to play a private detective.

Oleg

Eric Dofonsou added a comment - 19/Nov/03 12:34 AM
Hi Oleg.

Yes, I'am using a multipart post.

In our application we want to upload files to a file server from a java
application via HTTP. We use multipart because we have to include extra
information for the server application to be able to handle the data (ie : link
the file to a database object etc ...). We also want to be able to upload
multiple files (wichi works well as long as we have no accent in the filenames)

--
Here is the code that buids the file parts

HttpClient client = new HttpClient();
MultipartPostMethod httpsPost = new MultipartPostMethod ( m_docServer );

//Set header information
httpsPost.setRequestHeader("Content-Type", "multipart/form-data;
boundary="+BOUNDS);

//Adding the main parts.
StringPart partToAdd = new StringPart("ClassUID", classUID);
partToAdd.setTransferEncoding(null);
partToAdd.setContentType(null);
httpsPost.addPart( partToAdd );

partToAdd = new StringPart("MethodName", methodName);
partToAdd.setTransferEncoding(null);
partToAdd.setContentType(null);
httpsPost.addPart( partToAdd );

partToAdd = new StringPart("Params", params);
partToAdd.setTransferEncoding(null);
partToAdd.setContentType(null);
httpsPost.addPart( partToAdd );


//Adding teh files parts.
int i=0;
Iterator iterator = parts.keySet().iterator();
AI_DOCPART part;
String partID;
String partFile;
FilePart fPart;

//loop until we have created all file parts.
while(iterator.hasNext()){
  part = (AI_DOCPART)(iterator.next());
  partID = part.getIDAsString();
  partFile = (String) parts.get(part);
  try {
    fPart = new FilePart("FILE"+(i+1), new File(partFile));
    //partToAdd.setContentType(null);
    //partToAdd.setTransferEncoding( null );
    httpsPost.addPart(fPart);
  }
  catch (FileNotFoundException e) {
  throw new AIException("ERR_INVALIDE_FILENAME","",GUIMediator.getStringResource
("Corporate","ERR_INVALIDE_FILENAME"),"");
  }
  partToAdd = new StringPart("PARTNUMBER"+(i+1) , partID);
  partToAdd.setContentType(null);
  partToAdd.setTransferEncoding( null );
  httpsPost.addPart( partToAdd );
  i++;
}

//Set timeout in Milliseconds -> 30 secondes
client.setConnectionTimeout( 30000 );

//Send the data
int status=0;
try {
status = client.executeMethod(httpsPost);
}
...

Oleg Kalnichevski added a comment - 19/Nov/03 01:30 AM
Form-based File Upload in HTML specification (RFC 1867)
<http://www.ietf.org/rfc/rfc1867.txt> that HttpClient implements follows the
rules of all multipart MIME data streams as outlined in RFC 1521 and RFC 1522.
MIME specification requires all non-ASCII content to be represented using ASCII
charset only. Currently HttpClient does not perform such translation
automatically. You will have to take care of filename encoding prior to passing
it to the FilePart as a parameter.

I was going to contribute quote-printable encoder/decoder to the Commons Codec
library but never got a chance.

To sum things up: if the relevant RFCs are to be strictly adhered to, the
behaviour on the part of HttpClient is correct. However, I do agree that it
would be nice if HttpClient took care of non-ASCII charset translation
automatically. So, feel free to reopen this bug as a feature request.

Oleg

Oleg Kalnichevski added a comment - 13/Jan/04 07:30 PM
Re-opened as a feature request

Oleg Kalnichevski added a comment - 08/Aug/04 08:06 PM
*** HTTPCLIENT-368 has been marked as a duplicate of this bug. ***

Henri Yandell made changes - 12/May/06 02:14 PM
Field Original Value New Value
issue.field.bugzillaimportkey 24504 12333852
Francis Labaere added a comment - 30/Oct/06 12:37 PM
I just wanted to add some interesting RFC for this feature request:

RFC 2231
RFC 2047
RFC 2184

Oleg Kalnichevski made changes - 12/Dec/06 07:49 PM
Assignee HttpComponents Dev [ httpclient-dev@jakarta.apache.org ]
Dolf Dijkstra added a comment - 25/Apr/07 10:54 AM
I have created a patch against revision 532277 for this problem. Although it is not according to the RFC it does do the job. For instance IE is doing the same for multi-part mime upload. Not that I am suggesting that IE is doing the right thing, but it does mean that probably many servers can deal with post.

Index: src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java
===================================================================
--- src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java (revision 532277)
+++ src/java/org/apache/commons/httpclient/methods/multipart/FilePart.java (working copy)
@@ -193,7 +193,11 @@
         if (filename != null) {
             out.write(FILE_NAME_BYTES);
             out.write(QUOTE_BYTES);
- out.write(EncodingUtil.getAsciiBytes(filename));
+ //still not the rigth thing according to RFC1522
+ out.write( EncodingUtil.getBytes( filename, this.getCharSet() ) );
+ /*TODO: the right thing would be to do this, but some MIMEDecoders can't handle it.
+ String s = MimeUtility.encodeText(filename);
+ */
             out.write(QUOTE_BYTES);
         }
     }



Oleg Kalnichevski added a comment - 27/Apr/07 01:48 PM
Dolf,
What is MimeUtility and what package does it come from?

Oleg

Dolf Dijkstra added a comment - 27/Apr/07 03:40 PM
Hi Oleg,

Thanks for looking into this and sorry for not making clear where MimeUtility originates from.

MimeUtility is from javax.mail (for instance http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/mail/internet/MimeUtility.html).

Dolf


Oleg Kalnichevski added a comment - 27/Apr/07 03:55 PM
Dolf,

We simply cannot not introduce a new dependency for HttpClient 3.x code line. This will have to wait until 4.0

Oleg

Dolf Dijkstra added a comment - 27/Apr/07 04:31 PM
Hi Oleg,

Maybe the report is not clear.

According to the mult-part mime spec the correct behaviour would be to use a construct is via the MimeUtility. The problem with that is that the mime-parsers that I have tested with don't handle this correctly.
When just encoding the filename with the charset of the request, it works but it is not according to the spec.

The patch I handed in, works on most mime parsers (as IE is doing this too) but is not according to the spec.

I understand that you don't want to introduce a new dependancy, but maybe you don't need to as the patch works without the MimeUtility. The line containing the MimeUtility is commented.


Dolf


Sebb added a comment - 27/Apr/07 04:35 PM
Might even be a problem for 4.0 - the license for the JavaMail jar is such that it cannot be distributed by the ASF, as far as I am aware.

Might be worth checking if Commons-Lang has anything suitable, e.g. in StringEscapeUtils.

Oleg Kalnichevski added a comment - 27/Apr/07 09:52 PM
Dolf,
My bad. I overlooked that fact that the reference to MimeUtility was inside a comment block.

Sebastian,
I believe we can depend on JavaMail, as long as we do not have it in the repository and do not ship it with the release packages. Since we do not bundle dependencies with HttpClient anyways, this should not be a problem for us. Having said all that, I think Commons Codec HttpClient is already dependent upon provides all the necessary codecs (BASE64 and quote-printable). It is just a matter of someone taking up this job.

Folks,
Any objections to relaxing the compliance with the spec and applying the patch submitted by Dolf?

Oleg

Ahmed Ashour added a comment - 24/Oct/07 06:27 AM
One of HtmlUnit users came across this bug while trying to upload a file with non-ASCII name.

By sniffing the traffic generated by IE7, "filename" is encoded with page charset as Dolf has kindly suggested.

However, IE7 does not send the charset after the 'Content-Type':

---------------------------
Content-Disposition: form-data; name="field_name"; filename="C:\non_ascii.txt"
Content-Type: text/plain
---------------------------

So, to exactly mimic this behaviour, appreciate if part charset is separated from the "Content-Disposition" charset.

Many thanks.

Ortwin Glück added a comment - 24/Oct/07 09:38 AM
We should be spec compliant and not "compatible with most" implementations - I don't care how wrong IE7 implements this. RFC 2183, Section 2.3 clearly states the limitation to ASCII. People should just accept this limitation instead of trying to bend the standard to their needs. Standards are made to ensure interoperability, for $DIETY's sake. If you need to pass a non-ASCII filename, this is simply not the place for it. You could add another text/plain MIME part with a well-defined charset and pass the file name there for instance.

Mohammad Rezaei added a comment - 24/Oct/07 02:39 PM
Ortwin, I think the RFC is worded strangely. It is certainly true that Section 2.3 says US-ASCII only, but it seems like that section is outdated.

In Section 2, there is a very large note that reads:

NOTE ON PARAMETER VALUE LENGHTS: A short (length <= 78 characters)
   parameter value containing only non-`tspecials' characters SHOULD be
   represented as a single `token'. A short parameter value containing
   only ASCII characters, but including `tspecials' characters, SHOULD
   be represented as `quoted-string'. Parameter values longer than 78
   characters, or which contain non-ASCII characters, MUST be encoded as
   specified in [RFC 2184].

Looking at the types of parameters, 4 of them are dates and one is an integer. The only one that's a string is the filename, so the note above must refer to it. RFC 2184 describes how to encode the non-ASCII case. Interestingly, it looks IE does not follow RFC 2184.

Section 2.3 refers to RFC 2045, which is older than RFC 2184.

Overall, I'd say the RFC is unclear on this issue.

Thanks
Moh


Ortwin Glück added a comment - 24/Oct/07 02:52 PM
Interesting, although I have never seen it being used in the wild. By the way, RFC 2184 is obsoleted by RFC 2231.

Roland Weber made changes - 07/Feb/08 08:34 PM
Component/s HttpClient [ 12311010 ]
Component/s HttpMime [ 12312149 ]
Oleg Kalnichevski made changes - 16/Apr/08 05:49 PM
Fix Version/s 4.0 Final [ 12311094 ]
Fix Version/s 4.0 Alpha 5 [ 12313110 ]
Oleg Kalnichevski made changes - 21/May/08 05:18 PM
Fix Version/s 4.0 Alpha 5 [ 12313110 ]
Fix Version/s 4.0 beta 2 [ 12313164 ]
Oleg Kalnichevski added a comment - 02/Dec/08 12:50 PM
MultipartEntity now encodes non-ASCII characters in the disposition-content header using content charset when used in the browser compatibility mode and replaces non-ASCII characters with ? when used in the strict mode. One always has an option to encode the file name using one of the standard encoding mechanisms as described in RFC2231 and RFC2047.

Closing this issue as resolved.

Oleg

Oleg Kalnichevski made changes - 02/Dec/08 12:50 PM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]